Sei sulla pagina 1di 4


By Chris Kekaha TUI Online

As technological advances increase our ability to access and examine more and more data than ever before, our ability to effectively mine the data available will soon be surpassed by its volume. We can now collect from so many different sources that inevitably some of our data will be unusable, if not useless. The need to mine the data for trends and patterns not previously detected requires some insight into the connections between different types of data, and different categories of knowledge. While some kinds of connections and patterns can be machine- or process-driven, the software cannot be programmed to think too far outside the box, or to speculate about how some data might affect other (s). Human analysts work at human speed, but when they enter their queries into the software, their data is mined at computer speed. But what happens to the data after it is mined? It continues to reside in the storage unit that it has always been in, waiting to be mined again for some other parameter or connection. We can t simply delete it, as it may prove useful to us in the future, again and again. We can back it up and compress our archives, but meanwhile we are still compiling more and more data across multiple information streams on a daily basis. The best option is to warehouse our data, and keep or create separate databases for the purposes of mining. As our warehouse databases fill up, the overwhelming bulk of the data they hold becomes prohibitive for even the most robust processing, and the time required to mine the entire database with our strongest system becomes cost- and time-ineffective. For this reason, it is recommended that when we need to mine data for suspected patterns or to verify established institutional knowledge, we export only the data that we need to mine into smaller, more manageable databases. This allows our data mining software to

perform as designed, and ensures (assuming we made the right choices) that we are not wasting our resources on data that is irrelevant to our purposes. In other words, it ensures that we are only mining the correct, or to be more accurate, the appropriate data. When our mining operations are concluded, these databases could be deleted so they do not weigh down our systems. The model or structure of these ad hoc mining databases could then be remembered, and the same appropriate data can be called up again at a later date, with current and up-todate information. The time and effort to extract the data for mining, and to design and build the structure of the mining database, may make it more cost-effective to keep the mining database for later use, as the care and feeding costs will be significantly lower than a ground-up rebuild every time we want to plumb the same mines. By keeping our mining databases a manageable size, we can stem the overflow of the massive amount of data we are inundated with daily. Of course, the more data we have access to, the greater the chance that we can extract actionable intelligence from it, so there is no good argument for storing less data. The answer is simply to control and manage the data that we do store. As automated ETL technology improves and it becomes easier to extract our ad hoc databases for mining, it becomes more cost-efficient to simply keep them on a temporary basis. This helps us to manage the amount of data on hand, and to keep the data miners from overwhelming the organization.

REFERENCES: Lohr, S. (2007). Reaping Results: Data-Mining Goes Mainstream. Retrieved March 29, 2011 from html?_r=1 Parkingson, J. (2005). Pack-rat Approach to Data Storage is Drowning IT. Retrieved March 29, 2011 from Thearling, K. (1997). Understanding Data Mining: It s All in the Interaction. Retrieved March 29, 2011 from Two Crows Corporation (2005). Introduction to Data Mining and Knowledge Discovery (Third Edition). Retrieved March 29, 2011 from