100 Top Data Mining Job Interview Questions and Answers

Data Mining Questions with Answers:-

1. Define what is data mining?
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.

2. Differentiate between Data Mining and Data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Whereas data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns, etc.
E.g. a data warehouse of company stores all the relevant information about projects and employees. Using Data mining, one can use this data to generate different reports like profits generated, etc.

3. Define what is Data purging?
The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of unnecessary NULL values of columns. This usually happens when the size of the database gets too large.

4. Define what are CUBES?
A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze the weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

5. Define what are the different problems that “Data mining” can solve?

  • Data mining helps analysts in making faster business decisions which increases revenue with lower costs.
  • Data mining helps to understand, explore and identify patterns of data.
  • Data mining automates the process of finding predictive information in large databases.
  • Helps to identify previously hidden patterns.

6. Define what are different stages of “Data mining”?

  1. Exploration: This stage involves the preparation and collection of data. it also involves data cleaning, transformation. Based on the size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
  2. Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied to the different data sets and compared for best performance. This stage is also called as pattern identification. This stage is a little complex because it involves choosing the best pattern to allow easy predictions.
  3. Deployment: Based on model selected in the previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.

7. Define what is Discrete and Continuous data in Data mining world?
Discrete data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age.

8. Define what is MODEL in Data mining world?
Models in Data mining help the different algorithms in decision making or pattern matching. The second stage of data mining involves considering various models and choosing the best one based on their predictive performance.

9. How do the data mining and data warehousing work together?
Data warehousing can be used for analyzing the business needs by storing data in a meaningful form. Using Data mining, one can forecast business needs. A data warehouse can act as a source of this forecasting.

10. Define what is a Decision Tree Algorithm?
A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes input an object and outputs some decision. All Paths from the root node to the leaf node are reached by either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The decision tree is not affected by Automatic Data Preparation.

11. Define what is Naive Bayes Algorithm?
Naive Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns. This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states. After the model is made, the results can be used for exploration and making predictions.

12. Explain the clustering algorithm.
The clustering algorithm is used to group sets of data with similar characteristics also called clusters. These clusters help in making faster decisions and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships. The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data.

13. Define what is Time Series algorithm in data mining?
Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis.
E.g. Performance one employee can influence or forecast the profit.

14. Explain the Association algorithm in Data mining.
Association algorithm is used for the recommendation engine that is based on a market-based analysis. This engine suggests products to customers based on Define what they bought earlier. The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated items that appear into an item set.

15. Define what is Sequence clustering algorithm?
Sequence clustering algorithm collects similar or related paths, sequences of data containing events. The data represents a series of events or transitions between states in a data set like a series of web clicks. The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering.
E.g. Sequence clustering algorithm may help to find the path to store a product of “similar” nature in a retail warehouse.

DATA MINING Questions pdf free download::

16. Explain the concepts and capabilities of data mining.
Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns, etc. it is more commonly used to transform a large amount of data into a meaningful form. Data here can be facts, numbers or any real-time information like sales figures, cost, metadata, etc. The information would be the patterns and the relationships amongst the data that can provide information.

17. Explain how to work with the data mining algorithms included in SQL Server data mining.
SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the patterns and relationships of the data. This also helps in the enhanced analysis. The Add-in called as Data Mining Client for Excel is used to first prepare data, build, evaluate, manage and predict results.

18. Explain how to use DMX-the data mining query language.
Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage data mining models. DMX comprises of two types of statements: Data definition and Data Manipulation. Data definition is used to define or create new models, structures.
Data manipulation is used to manage the existing models and structures.

19. Explain how to mine an OLAP cube.
A data mining extension can be used to slice the data the source cube in the order as discovered by data mining. When a cube is mined the case table is a dimension.

20. Define what are the different ways of moving data/databases between servers and databases in SQL Server?
There are several ways of doing this. One can use any of the following options:

  • Detaching/attaching databases,
  • Replication,
  • DTS,
  • BCP,
  • log shipping,
  • creating INSERT scripts to generate data.

21. Define what are OLAP and OLTP?
An IT system can be divided into the Analytical Process and Transactional Process.

  • OLTP – categorized by short online transactions. The emphasis is query processing, maintaining data integration in a multi-access environment.
  • OLAP – Low volumes of transactions are categorized by OLAP. Queries involve aggregation and very complex. Response time is an effectiveness measure and used widely in data mining techniques.

22. Explain the clustering algorithm?
“Cluster is a collection of objects which have similarity between then and are dissimilar from objects different clusters.”
Following are the ways a clustering technique works:

  • Exclusive: A member belongs to only one cluster.
  • Overlapping: A member can belong to more than one cluster.
  • Probabilistic: A member can belong to every cluster with a certain amount of probability.
  • Hierarchical: Members are divided into hierarchies, which are sub-divided into clusters at a lower level. “

23. Explain in detail neural networks?
“Humans always wanted to beat god and neural networks is one of the steps towards that. The neural network was introduced to mimic the sharpness of how the brain works. Whenever human see something, any object, for instance, an animal. Many inputs are sent to his brains, for example, it has four legs, big horns, long tail, etc. With these inputs, your brain concludes that it’s an animal. From childhood, your brain has been trained to understand these inputs and your brain concludes output depending on that. This all happens because of those 1000 neurons which are working inside your brain interconnected to decide the output. ”

24. Define what is back propagation in neural networks?
“Backpropagation is a common method of teaching artificial neural networks how to perform a given task
It is a supervised learning method and is a generalization of the delta rule. It requires a teacher that knows or can calculate, the desired output for any input in the training set. It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop). The term is an abbreviation for “backward propagation of errors”. Backpropagation requires that the activation function used by the artificial neurons (or “nodes”) be differentiable. ”

25. Define what is the time series algorithm in data mining?
“The Microsoft Time Series algorithm allows you to analyze and forecast any time-based data, such as sales or inventory. So the data should be continuous and you should have some past data on which it can predict values.”