Extracting value from data - this is essentially the foundation of data science. This value may take many shapes and forms depending on the contexts - but usually, it comes in the form of better decisions.
In the 21st century, data has become a source of power, which organizations and individuals of every form are seeking ways to collect, control and capitalize upon. According to the Oxford Dictionary, data is defined as "facts and statistics collected together for reference and analysis". The possibilities that come with the increasing quantity of data available are seemingly endless, implying that although data is not inherently valuable like gold or silver, it has the power to lead to informed business decisions, which, in turn, is valuable. Collections of raw data are classified based on the form in which they are presented, as well as the data types of their underlying variables.
Data mining is a term that has been exploited so variably that it is beginning to lose meaning. The problem is partially due to the complexity and breadth of activities which fall under the umbrella of data mining. This exploitation has caused major confusion and misinterpretation amongst people, specifically those lacking a scientific background, as such individuals seem to haphazardly refer to data mining without a genuine understanding of what this technology entails. Data mining is an extension of traditional data analysis and statistical approaches. It incorporates analytical techniques drawn from a range of disciplines including statistics, visualization, pattern recognition, and areas of artificial intelligence such as neural networks and machine learning.
Knowledge discovery in databases (KDD), a phrase proposed by Gregory Piatetsky-Shapiro in 1989, has a much broader interpretation than data mining. It describes the entire process of using unprocessed raw data to generate information that is easy to adopt in the context of decision making. Fayyad et al. classified the steps in KDD into five segments which are shown in the figure below. KDD includes not only the processing of data, but also selection, storage, and human interpretation of the results acquired during the KDD process. Data mining, however, includes only the processing of data through largely automatic means, for example, programming methods and data mining tools. Data mining shares many similarities with the field of statistics.
Statistical tools and methods are used within the process of data mining, however, they differ in various ways. Data mining is typically conducted on large volumes of data, whereas statistics generally makes use of smaller samples. Furthermore, data used in statistics are mainly two dimensional, however, various forms of data, even with a high dimensionality, can be analyzed by means of data mining techniques. Data mining is generally performed on data that has already been collected for other purposes — a by-product of an operational system. As such, data mining tends to rely on incomplete data more so than statistics, as in statistics the data collection process is more controlled.
It has been said that data mining results are inherently fuzzy or soft as the data is sometimes incomplete and inexact. Despite the differences between the two, statistical methods and procedures are largely important in data mining, particularly when it comes to developing and assessing models. The statistical aspect of data mining falls under the data transformation and data mining stages in the KDD process. Another important aspect of data mining is data visualisation. Data visualisation (which is what we love here at The Data School) aims to represent large-scale data collections visually to aid in the understanding and analysis of information. Data visualisation seeks to provide a representation of the discovered relationships and patterns in various forms in order to engage the information-processing abilities of human analysts. Data mining and data visualisation are inter-related as it is important to identify and understand the deeper relationships discovered within the data mining process. Visualisation allows for the exploration of data before modelling it in however way necessary. Data visualisation forms part of the evaluation and presentation stage within the KDD process. The relationship between data mining and data visualization is shown below.
Tableau and Alteryx have allowed us to exploit data mining and visualisation with ease and the rest of the blog posts in this series will aim to show you how!
Fayyad U, Piatetsky-Shapiro G & Smyth P, 1996, From data mining to knowledge discovery in databases, AI magazine, 17(3), pp. 37
Calders T & Custers B, 2013, What is data mining and how does it work? , SpringerVerlag Berlin Heidelberg.
Furnas A, 2012, Everything you wanted to know about data mining but were afraid to ask, [Online], [Cited May 2019], Available from https://www.theatlantic.com/technol ogy/archive/2012/04/everything-you-wanted-to-know-about-data-mining-butwere-afraid-to-ask/255388/.
Jackson J, 2002, Data mining: A conceptual overview, Communications of the Association for Information Systems, 8(1), pp. 19.
Kazmaier J, 2017, A machine learning data analysis decision-support system, BEng (Industrial) Final year project, University of Stellenbosch.
Oxford University Press, 2019, Definition of data in English, [Online], [Cited April 2017], Available from https://en.oxforddictionaries.com/definition/data.