Data Mining Process

Project/Business Understanding:

Identify potential benefits, risks and

efforts of successful project.

澳洲 bilogy Assignment 代写

澳洲 bilogy Assignment 代写

Data Understanding: Sufficient

relevant data

Visual assessment of basic

relationships and properties

Data quality (missing values)

Abnormal cases (outliers)

Data Preparation: Selection,

correction and modification of data

Modeling: Extract knowledge out of

data in the form of a model

Predictive – Explanatory

Evaluation

Deployment

Data Mining Cycle

Data Understanding

澳洲 bilogy Assignment 代写

澳洲 bilogy Assignment 代写

Main Goal

Gain general insights about the data that will potentially be

helpful for the further steps in the data analysis process

Not driven exclusively by goals and methods of later steps

Approach data from neutral viewpoint

Never trust data before carrying out simple plausibility

checks

At the end of Data Understanding we know much better

whether the assumptions we made during the Project

Understanding phase concerning: representativeness,

informativeness, and data quality are justified

Visualisation: Overview of basic characteristics of data and

check plausibility

Simple statistics

Outliers, missing values, data quality

Data Visualisation

Bar chart: Frequency distribution for categorical attribute

Histogram: Frequency distribution for numerical attribute

澳洲 bilogy Assignment 代写

澳洲 bilogy Assignment 代写

Divide values into bins and show a bar plot of the number of

objects in each bin

Height of each bar indicates the number of objects in bin

Shape of histogram depends on number of bins

Boxplots

Very compact method to visualise distribution of one

attribute

Many boxplots can fit in single plot: Useful for comparing

distributions

Scatterplots

Relationship between two attributes (linear/ non-linear)

Axes represent two considered attributes

Each instance in the dataset is represented by a point

Correlation between attributes

Outliers

With class label info: Separability of classes

Correlation Analysis

Scatterplots can give us an idea about correlations

between pairs of variables

Pearson’s correlation coefficient: Measure of linear

association between 2 numerical attributes. Always

between -1 and 1

Even if a functional dependency exists between two

attributes and the function is monotone, if it is non-linear

then Pearson’s correlation coefficient can be far away from

-1 and 1

Rank correlation coefficients overcome this by relying on

the ordering of the values of the attributes: Spearman’s rho

Outliers

Outlier

A value or a data object that is far away or very different from

most or all of the other data

Intuitive but imprecise definition

It might be worthwhile to exclude outliers from analysis

Different methods more robust to outliers than others

Categorical Attribute: value that occurs with very low

frequency

Numerical Attribute: Detection much more difficult.

Boxplot, Scatterplot

For multidimensional data much more complicated

approaches need to be used

Missing Values

Missing values: One of the most important problems in real

applications

Not one best way of handling missing values

Missing Completely at Random: No special

circumstances or special values of the variable in question

lead to higher or lower chances for values to be missing

Missing at Random: Probability of a missing value

depends on some other variable(s) Y but conditionally

on Y it is independent of the value of X

Nonignorable missing: Occurrence of missing values

directly depends on the true value of the attribute

Distinguish Between Types of Missing Values

Distinction between MCAR and MAR: In case of MAR

other attributes can be used to predict whether value is

missing

Turn considered attribute into binary variable: 1 if value

exists, 0 if it is missing

Build a classifier to predict binary variable using as inputs

other variables

Determine error rate

MCAR: Error rate is approx. equal to proportion of missing

values

MAR: Error rate is significantly lower (it could also be

non-ignorable missing)

In general not possible to distinguish non-ignorable

missing from the other two cases using only available data

Treating Missing Values

Explicit Value: Replace with new value for attribute

MISSING (nominal attributes)

If the fact that the value is missing carries information

about the value itself (non-ignorable missing) introduction

of new value can help because it can express an intention

not captured by other attributes

Better Approach Introduce new binary variable indicating

that value was missing in original dataset and then

substitute missing value

If neither other attributes or imputed value help but the fact

that the value was missing is important, binary variable

captures this

If no such missing value pattern is present the imputed

value can be used without introducing MISSING value

Relevance of Attribute

More realistic problem

Information available: X = H: P(G) high, X = L: P(G) low

General Decision Rule

Given the risk forecast of an applicant, X = {H,L}:

o(G|X = x) =P(G|X = x)

P(B|X = x)>‘

g

Relevance of Attribute

More realistic problem

TN + FP

Sensitivity: Minimise misclassification of Class 1 records

(also called Recall)

Specificity: Minimise misclassification of Class 0 records

ROC Curve

Critical points on ROC curve

(TPR,FPR)

(0,0): All records classified 0

(1,1): All records classified 1

(0,1): Ideal model

Random Classifier: Diagonal Line

Below diagonal line: Prediction is opposite of true class

Good classifier: As close as possible to upper left corner

Area Under ROC (AUC): Summarises ROC curve into a

single number

Cost-Sensitive Learning

Cost of Misclassification

C(i,j): Cost of misclassifying a pattern from class i to class j

Cost Matrix:

Predicted Class

C(i,j)

1

Increasing variable xjby 1:

Increases log(o(1|xi)) by βj

Increases o(1|xi) by factor of eβj

If xjis binary then xj= 1 increases o(1|xi) by eβj

Synopsis Logistic Regression

Linear predictor:

Accommodates quantitative and qualitative variables

(dummy)

Enables transformations and combinations (interactions)

while retaining interpretability. Logistic regressions extends

this idea to binomial data

Explanatory model:

Contribution of individual variables

Model comparison – Model selection

Confidence interval (not covered)

Linear relationship between attribute values and probability

of success

Non-linearities can be overcome using discretisation

Decision Trees

Decision Tree Approach

Ask series of questions about attributes to determine class

Build decision tree from top to bottom (from root to leaves)

Greedy selection of a test attribute

Compute an evaluation measure for all attributes

Select the attribute with the best evaluation

Greedy Strategy

Grows a decision tree by making a series of locally optimal

decisions about how to partition the data

Divide and conquer / recursive descent

Divide examples according to the values of the test attribute

Apply the procedure recursively to the subsets (Hunt’s

algorithm)

Characteristics of Decision Tree Induction

Non-parametric: No assumptions about the type of

probability distributions satisfied by the data

Finding optimal decision tree is computationally infeasible:

Greedy heuristic approaches

Decision tree induction algorithms construct trees quickly

even for very large train sets

Easy to interpret: Especially for small trees

Robust to presence of noise: Especially when methods to

avoid overfitting are employed

Redundant attributes do not adversely affect accuracy

If dataset contains many irrelevant attributes then some

could be accidentally chosen by tree-growing algorithm.

Feature selection

Characteristics of Decision Tree Induction

Data fragmentation: Number of records at leaf nodes can

become too small to make statistically significant decision

– Impose threshold on minimum number of records per

node

Subtree can be replicated many times within a decision

tree making the model more complex and harder to

interpret

Robust performance w.r.t. choice of impurity measure

Treatment of missing values

Small changes in train set can yield entirely different tree

Performance is robust

Performance adversely affected by too many interval

scaled variables (Discretisation)

Artificial Neural Networks (ANN)

ANNs inspired by attempts to model biological neural

systems

Brain consists of a large number of interconnected simple

processing units (neurons)

Learning in human brain takes place by changing the

strength of the synaptic connection between neurons

through repeated stimulation by the same impulse

Perceptron

Perceptron: Simple Model of a Neuron

Each input node is connected via a weighted link to the

summing junction

Weights emulate strength of synaptic connection between

neurons

Training adapts weights to reduce error

Can solve linearly separable problems

Artificial Neural Networks

Number of simple processing

units (nodes)

Organised in Layers

Output layer: Returns prediction

Input layer: Receives inputs

Hidden layers: Layers between

input and output layers

Topology: 5 × 3 × 1

Multilayer Perceptrons: Only

Feed-forward connections

More complicated decision boundaries can be

approximated using more nodes and more layers

Design Issues in ANNs

Systems that combine automatic feature extraction with

classification process

Increasing the number of hidden nodes and the number of

hidden layers ANNs can become very flexible classifiers

Flexibility can easily result to Overfitting

Selecting appropriate topology is Critical

No general rule for how to choose the number of hidden

layers and the size of the hidden layers

Small neural networks might not be flexible enough to fit the

data. Large neural networks tend to overfitting

Cannot handle missing values

Black box models: Explaining what an ANN has learned is

not straightforward

Very sensitive to chosen feature vector: Variable selection

and preprocessing necessary

Ensemble Methods

Central Idea

Improve accuracy by combining predictions of multiple

classifiers

Conditions for performance improvement

1 Base classifiers (close to) independent

2 Base classifiers better than random guessing

Constructing Ensemble Classifiers: Bagging

Bagging – Bootstrap Aggregating

Create many training sets

through Bootstrapping

(resampling with replacement)

Build classifier for each train set

Use majority vote to predict

Reduces variance of base classifiers

Unstable classifier: Sensitive to minor perturbations in

train data

Bagging reduces generalisation error of unstable classifiers

(Decision trees, Neural networks, k–nearest neighbours)

Can be detrimental for stable/ robust classifiers because

the size of the train set is reduced

Does not focus on particular instances of training data

Boosting

Example: Weights determine sampling distribution

Initially all weights are equal 1/N

At each round i = 1,2,...

Draw bootstrap sample Dibased on weights

Base classifier built on Diand used to classify all examples

from original dataset D

Increase weights of misclassified examples

Misclassified examples more likely to be chosen in

subsequent rounds

Attention focused on difficult to classify examples