Data Mining is an essential component of Big Data technologies and large data analysis techniques. It is the source of Big Data Analytics, predictive analysis and data mining. Discover the complete definition of Data Mining.
Data mining definition
Data mining, data mining or data mining, these are the possible translations of data mining in French. As a general rule, the term Data Mining refers to analyzing data from different perspectives and transforming the data into useful information, establishing relationships between the data or identifying patterns. This information can then be used by companies to increase turnover or reduce costs. It can also be used to better understand a customer base in order to establish better marketing strategies.
What is data mining?
Data Mining software is one of the analytical tools used for data analysis. They allow users to analyze data from different angles, categorize it, and summarize identified relationships. Technically, Data Mining is the method for finding correlations or patterns between many relational databases.
Data Mining is based on complex and sophisticated algorithms for segmenting data and assess future probabilities. Data Mining is also nicknamed Knowledge Discovery in Data.
A natural technological evolution
The term Data Mining is relatively new, but the technology is not. For years, companies have been using powerful computers to process the large volumes of data accumulated by supermarket scanners and to analyze market research reports. Similarly, continuous innovations in the fields of computer calculation, storage, and statistical software significantly increase the accuracy of analysis and drive cost reduction.
Data, Information and Knowledge in Data Mining
Data are facts, numbers, or text that can be processed by a computer. Today, companies accumulate vast amounts of data in different formats, in different amounts of data. Among these data, we can distinguish :
- The operational or transactional data such as sales, cost, inventory, receipt or accounting data.
- The non-operational datasuch as industrial sales, forecast data, macro-economic data.
- The metadataThis includes data about the data itself, such as definitions in a data dictionary.
The patterns, associations and relations between all these data allow to obtain information. For example, analyzing the transaction data from a point of sale provides information about what products are being sold and when.
The information can be converted into knowledge about historical patterns or future trends. For example, information on retail sales in a supermarket can be analyzed as part of promotional efforts to gain knowledge about shopper behaviour. For example, a producer or retailer can determine which products should be promoted using data mining.
What is a data warehouse?
Significant advances in data collection, computing power, data transmission, and storage capabilities enable companies to integrate databases into Data Warehouses. The Data Warehousing is the process of centralizing the management and retrieval of data..
Through a data warehouse, companies can break down data into specific user segmentsin order to analyse them in detail. Analysts can also start with the type of data they want to use and then create a warehouse from this data.
Like Data Mining, the term Data Warehousing is relatively new, while the concept itself has been around for years. Data Warehousing represents an ideal vision of a permanently maintained central data repository. This centralization is necessary to maximize user access and analysis.
Thanks to great technological advances, this utopian vision has become a reality for many companies…. Similarly, advances in analytical software allow users free access to data. It is on this analytical software that Data Mining is based.
Data Mining Methods
There are five varieties of data mining:
- Association – look for patterns in which one event is linked to another event.
- Sequence analysis – look for patterns in which one event leads to another later event.
- Classification – search for new patterns, even if it means changing the way the data is organized.
- Clustering – find and visually document previously unknown groups of facts.
- Prediction – discover data patterns that can lead to reasonable predictions about the future. This type of data mining is also known as predictive analysis.
What is the use of Data Mining in marketing?
Data Mining is currently mainly used by consumer-focused companies in the retail, finance, communication, or data mining marketing sectors. Data Mining techniques are also used in various research sectors, such as mathematics, cybernetics or genetics.. Web Mining, used in the field of customer relationship management, aims to identify patterns of user behaviour within the vast amounts of data collected by a website.
Thanks to Data Mining, companies can determine the relationships between internal factors such as prices and product positioning, employee skills and external factors such as economic indicators, competition, or consumer demographic informations.
They can then determine the impact of these relationships on sales, consumer satisfaction, and company profits. Finally, these relationships can be converted to information to obtain details on transactional data.
With data mining, A retailer can use customer purchase records in the store. to send targeted promotions based on an individual’s purchase history. By undermining the demographics of warranty card comments, the seller can develop products and promotions to appeal to certain consumer segments.
Concrete examples of the use of Data Mining
For example, a Midwestern grocery store chain used Oracle data mining software to analyze local purchasing patterns. The sign is uncovered that when men buy diapers on Thursdays and Saturdays, they also tend to buy beer. A thorough analysis also showed that these customers usually do their weekly shopping on Saturdays. On Thursdays, they only buy a few items. The chain concluded that customers buy their beers to get them ready for the weekend.
This newly discovered information may have be used in different ways to increase sales. For example, the beer section has been moved closer to the diaper section. Also, the retailer made sure that beers and diapers would no longer be sold out on Thursdays.
For example, Blockbuster Entertainment mines its historical video rental database to recommend movies to individual clients. Similarly, American Express may suggest products to its customers based on their monthly expenses.
The WalMart giant is pioneering massive data mining to transform its relationships with suppliers. WalMart collects transactional data from 2,900 stores in 6 different countries, and transmits this data continuously to its 7.5 terabyte Data Warehouse provided by Teradata. More than 3500 WalMart suppliers can access their product data and perform data analysis. These vendors use the data to identify customer buying patterns at the store level. They use the information to manage local store inventories and identify new opportunities. In 1995, WalMart computers handled nearly a million complex data requests.
The National Basketball Association (NBA) is exploring the use of data mining in conjunction with the recording of images from basketball games. The Advanced Scout software allows to analyze players’ movements, to help their coaches to orchestrate strategies.. For example, an analysis of the game between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that John Williams scored four baskets when Mark Price was on defence. This pattern could be detected by Advanced Scout, as well as the difference with the average accuracy percentage of the Knights during the game, which was 49.30%. Using the NBA World Clock, a coach can automatically view video clips of every shot Williams made while Price was on defense, without having to watch hours of video capture.
Enabling consumers to control their digital footprint
In the digital, social networking and connected age, marketers are constantly collecting massive amounts of data in real time. The enterprises monitor what consumers post, like, share on social networks, the devices they use, the credit cards they spend with, the cities in which they are located, and the cities in which they live.. For good reason, this data allows us to promote and sell products in a personalized way.
Many companies are now developing their own cloud marketing to collect information about their target customers. As a result, businesses and governments can easily use personal data for their business without asking for consent.
In order to remedy this problem, and to allow consumers to control their data, the startup Digi.me was founded in 2009. This startup provides consumers with the tools to take back their digital footprint, collect and share information directly with businesses on their own terms.. Digi.me is a leader in the “Internet of Me”. Once users take control of their data, they have the ability to price it and put up barriers to prevent anyone from accessing it without permission. Without control over their personal data, consumers are simply exploited without knowing it.
The technology developed by Digi.me allows users to download their data and store it on the internet.. The data is natively stored on an individual device, preventing third parties from accessing it. The startup has raised 10.6 million, 7 million of which in 2016. It also partners with Toshiba and Lenovo, and collaborates with leaders in the health insurance, finance and pharmaceutical industries.
Preventing Tax Evasion with Data Mining
In India, the government is determined to use data mining to prevent tax evasion. Indeed, India is deeply affected by this scourge. To remedy it, the tax department will use technology to make it easier for honest citizens to pay taxes, and harder for dishonest ones to do so.. It is not yet known how data mining will be used, but more details are expected to be released in the coming months.
Recruiting the best employees
The Recruitment professionals are increasingly using data mining tools to locate and identify the most interesting employees for their company.. In Ireland, for example, companies collect candidate data online to find the best talent. For example, the data can be used to determine a candidate’s level of productivity and satisfaction. This is why LinkedIn chose to build a new building to expand its Irish hub, acting as its European HQ. 200 new employees were added to a team of 1000 people.
How does Data Mining work?
Information technology has evolved to separate transactional and analytical systems. Data Mining provides the link between the two. Data Mining software analyzes the relationships and patterns of stored transaction data based on user queries. Several types of analytical software are available: statistics, Machine Learning, and Neural Networks. In general, there are four types of relationships:
- ClassesStored data: Stored data is used to locate data in predetermined groups. For example, a restaurant chain may mine customer purchase data to determine when customer visits occur and what their usual orders are. This information can be used to increase traffic by offering daily menus.
- ClustersData are grouped according to logical relationships or customer preferences. For example, data can be mined to identify market segments or customer affinities.
- AssociationsData can be mined to identify associations. The example of layers and beers cited above is an example of associative mining.
- Sequential PatternsData is mined to anticipate behaviour patterns and trends. For example, an outdoor equipment salesperson may predict the likelihood of a backpack being purchased based on a customer’s purchases of sleeping bags and hiking boots.
Data Mining is based on five major elements:
- The extraction, transformation, and loading of transactional data into the Data Warehouse system.
- Data storage and management in a multi-dimensional database system.
- Provide data access to business analysts and IT professionals.
- Analyze the data using application software.
- Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
- Artificial neural networksNon-linear predictive models that learn by training and are similar in structure to biological neural networks.
- Genetic algorithmsOptimization techniques use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
- Decision treesThese tree-like structures represent sets of decisions. These decisions generate rules for classifying a dataset. Specific methods for decision trees include Classification and Regression Trees (CART), and Chi Square Automatic Interaction Detection (CHAID). Both of these methods are used to classify a dataset. They provide a set of rules that can be applied to a new dataset to predict which records will have a result. CART segments a dataset by creating a two-way split, while CHAID segments the dataset by using chi square tests to create multi-way outcomes. In general, CART requires less data preparation than CHAID.
- The nearest neighbor methodThis technique classifies each record in a dataset based on a combination of the classes of the k, similar to a historical dataset.
- Rule inductionRule extraction “if-so” from the data, based on statistical significance.
- Data visualizationVisual interpretation of complex relationships in multidimensional data. Graphical tools are used to illustrate data relationships.
The Data Mining Process in 5 steps
The Data Mining Process can be broken down into 5 steps. In the first instance, companies collect the data and load it into the Data Warehouses. Then they store and manage the data, either on physical servers or on the Cloud. Business analysts, management teams and IT professionals access this data and determine how they want to organize it. The application software then sorts the data based on user results. Finally, the end user presents the data in an easy-to-share format such as a graph or table.
The 3 main properties of Data Mining
There are 3 main properties in Big Data Mining :
- Automatic pattern discovery
Data Mining is based on the development of models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of Data Mining models. Data Mining models can be used to mine the data on which they are built, but most types of models can be generalized to new data. The process of applying a model to new data is called scoring.
- Predicting probable outcomes
Many forms of data mining are predictive. For example, a model can predict an outcome based on education and other demographic factors. Predictions have an associated probability. Some forms of predictive data mining generate rules, which are the conditions for obtaining an outcome. For example, a rule may specify that a person with a bachelor’s degree and living in a specific neighborhood has a probability of having a better salary than the regional average.
- The creation of usable information
Data Mining allows to to extract usable information from large volumes of data. For example, an urban planner can use a model to predict income based on demographic data to develop a plan for low-income households. A car rental agency can use a model to identify consumer segments to create a promotion targeting high-value customers.
What technology infrastructure is required?
Today, Data Mining applications are available in all sizes for mainframe, server or PC. System prices range from several thousand dollars for the smallest applications and up to 1 million dollars per terabyte for the largest ones. Enterprise applications typically range from 10 gigabytes to more than 11 terabytes. NCR has the capacity to deliver applications of more than 100 terabytes. There are two main technology drivers:
- The size of the databaseThe more data to be processed and maintained, the more powerful a system is required.
- The complexity of the requests: The more complex and numerous the requests, the more a powerful system is required.
Relational database storage and management technologies are suitable for many data mining applications below 50 gigabytes. However, this infrastructure needs to be greatly expanded to support larger applications. Some vendors have added greater indexing capabilities to increase query performance. Others use new hardware architectures such as Massiely Parallel Processors (MPPs) to improve query processing time. For example, NCR’s MPP systems link hundreds of Pentium processors to achieve higher levels of performance than the best supercomputers.
Data Mining Software
Data Mining Software Analyze relationships between data and identify patterns based on user queries.. For example, software can be used to create information classes. For example, a restaurant can use Data Mining to determine when to offer certain items. It will then be necessary to search through the collected information and create classes based on when customers visit and what they order.
In other cases, data miners find clusters of information based on logical relationships, or they look for associations and sequential patterns to draw conclusions about user behavior. Data Mining software is available for this purpose. Orange, Weka, RapidMiner or Tanagra are some of the open source tools available on the Web. Professional licenses for Data Mining 19 are also available. Among the most famous of them, SPSS distributed by IBM, Enterprise Miner from SAS, or Microsoft Analysis Services from the Redmond firm.
Data Mining courses
Many universities devoted to computer science and mathematics are exploring this probability technique. Data mining courses and moocs are readily available on the web to understand and explore in more detail the possibilities of this science associated with Big Data. There are many courses in Data Mining in PDF which you can download. Attention, the level varies according to the type of teaching. For our part, we recommend you the work of Stéphane Tufféry, President of the Scientific Committee of the CESP of the University of Rennes 1. Specialized in this field, he even wrote a book on this subject.