Title: | Feature Attributions for ClusTering |
---|---|
Description: | We present 'FACT' (Feature Attributions for ClusTering), a framework for unsupervised interpretation methods that can be used with an arbitrary clustering algorithm. The package is capable of re-assigning instances to clusters (algorithm agnostic), preserves the integrity of the data and does not introduce additional models. 'FACT' is inspired by the principles of model-agnostic interpretation in supervised learning. Therefore, some of the methods presented are based on 'iml', a R Package for Interpretable Machine Learning by Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl (2018) <doi:10.21105/joss.00786>. |
Authors: | Henri Funk [aut, cre], Christian Scholbeck [aut, ctb], Giuseppe Casalicchio [aut, ctb] |
Maintainer: | Henri Funk <[email protected]> |
License: | LGPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-20 04:21:52 UTC |
Source: | https://github.com/henrifnk/fact |
A ClustPredictor
object holds any unsupervised clustering algorithm
and the data to be used for analyzing the model. The interpretation methods
in the FACT
package need the clustering algorithm to be wrapped in a
ClustPredictor
object.
A Cluster Predictor object is a container for the unsupervised prediction model and the data. This ensures that the clustering algorithm can be analyzed in a robust way. The Model inherits from iml::Predictor Object and adjusts this Object to contain unsupervised Methods.
iml::Predictor
-> ClustPredictor
type
character(1)
Either partition for cluster assignments or prob
for soft labels. Can be decided by chosen by the
user when initializing the object. If NULL
,
it checks the the dimensions of y
.
cnames
character
Is NULL
, if hard labeling is used. If soft
labels are used, column names of y
are being
transferred.
new()
Create a ClustPredictor object
ClustPredictor$new( model = NULL, data = NULL, predict.function = NULL, y = NULL, batch.size = 1000, type = NULL )
model
any
The trained clustering algorithm. Recommended
are models from mlr3cluster
. For other
clustering algorithms predict functions need to
be specified.
data
data.frame
The data to be used for analyzing the prediction model. Allowed column
classes are: numeric, factor, integer, ordered and character
predict.function
function
The function to assign newdata. Only needed if
model
is not a model from mlr3cluster
. The
first argument of predict.fun
has to be the
model, the second the newdata
:
function(model, newdata)
y
any
A integer vector representing the assigned
clusters or a data.frame representing the
soft labels per cluster assigned in columns.
batch.size
numeric(1)
The maximum number of rows to be input the model for prediction at once.
Currently only respected for SMART.
type
character(1)
)
This argument is passed to the prediction
function of the model. For soft label
predictions, use type="prob"
. For hard label
predictions, use type="partition"
. Consult
the documentation or definition of the
clustering algorithm you use to find which type
options you have.
clone()
The objects of this class are cloneable with this method.
ClustPredictor$clone(deep = FALSE)
deep
Whether to make a deep clone.
require(factoextra) require(FuzzyDBScan) multishapes <- as.data.frame(multishapes[, 1:2]) eps = c(0, 0.2) pts = c(3, 15) res <- FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create hard label predictor predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters, predict.function = predict_part, type = "partition") # create soft label predictor predict_prob = function(model, newdata) model$predict(new_data = newdata) ClustPredictor$new(res, as.data.frame(multishapes), y = res$results, predict.function = predict_prob, type = "prob")
require(factoextra) require(FuzzyDBScan) multishapes <- as.data.frame(multishapes[, 1:2]) eps = c(0, 0.2) pts = c(3, 15) res <- FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create hard label predictor predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters, predict.function = predict_part, type = "partition") # create soft label predictor predict_prob = function(model, newdata) model$predict(new_data = newdata) ClustPredictor$new(res, as.data.frame(multishapes), y = res$results, predict.function = predict_prob, type = "prob")
Create the algorithms prediction function.
create_predict_fun(model, task, predict.fun = NULL, type = NULL) ## S3 method for class 'Learner' create_predict_fun(model, task, predict.fun = NULL, type = NULL)
create_predict_fun(model, task, predict.fun = NULL, type = NULL) ## S3 method for class 'Learner' create_predict_fun(model, task, predict.fun = NULL, type = NULL)
model |
any |
task |
|
predict.fun |
function(model, newdata) To be extended for more methods. |
type |
|
A unified cluster assignment function for either hard or soft labels.
create_predict_fun(Learner)
: Create a predict function for algorithms from
mlr3cluster
Calculation of binary similarity metric based on confusion matrix.
evaluate_class(actual, predicted, metric = "f1") calculate_confusion(actual, predicted)
evaluate_class(actual, predicted, metric = "f1") calculate_confusion(actual, predicted)
actual |
|
predicted |
|
metric |
|
A binary score for each of the clusters and the number of instances.
calculate_confusion()
: Calculate confusion matrix
IDEA
with a soft label predictor (sIDEA)
tacks changes the soft label of being assigned to each existing cluster
throughout a (multidimensional) feature space
IDEA
with a hard label predictor (hIDEA)
tacks changes the soft label of being assigned to each existing cluster
throughout a (multidimensional) feature space
IDEA
for soft labeling algorithms (sIDEA) indicates the soft label that an
observation with replaced values
is assigned to
the k-th cluster.
IDEA
for hard labeling algorithms (hIDEA) indicates
the cluster assignment of an observation with replaced values
.
The global IDEA
is denoted by the corresponding data set X:
where the c-th vector element is the average c-th vector element of local sIDEA functions. The global hIDEA corresponds to:
where the c-th vector element is the fraction of hard label reassignments to the c-th cluster.
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
feature
(character or list
)
Features/ feature sets to calculate the effect curves.
method
character(1)
The IDEA
method to be used.
mg
DataGenerator
A MarginalGenerator
object to sample and generate
the pseudo instances.
results
data.table
The IDEA
results.
noise.out
any
Indicator for the noise variable.
type
function
Detect the type in the predictor
new()
Create an IDEA object.
IDEA$new(predictor, feature, method = "g+l", grid.size = 20L, noise.out = NULL)
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
feature
(character or list
)
For which features do you want importance scores calculated. The default
value of NULL
implies all features. Use a named list of character vectors
to define groups of features for which joint importance will be calculated.
method
character(1)
The IDEA
method to be used. Possible choices for the method are:"g+l"
(default): store global and local IDEA
results
"local"
: store only local IDEA
results
"global"
: store only global IDEA
results
"init_local"
: store only local IDEA
results and
additional reference for the observations initial
assigned cluster.
"init_g+l"
store global and local IDEA
results and
additional reference for the observations initial
assigned cluster.
grid.size
(numeric(1) or NULL)
size of the grid to replace values. If grid size is
given, an equidistant grid is create. If NULL
, values
are calculated at all present combinations of feature values.
noise.out
any
Indicator for the noise variable. If not NULL, noise will
be excluded from the effect estimation.
(data.frame)
Values for the effect curves:
One row per grid per instance for each local idea
estimation. If method
includes global estimation, one
additional row per grid point.
plot()
Plot an IDEA object.
IDEA$plot(c = NULL)
c
indicator for the cluster to plot. If NULL
,
all clusters are plotted.
(ggplot)
A ggplot object that depends on the method
chosen.
plot_globals()
Plot the global sIDEA curves of all clusters.
IDEA$plot_globals(mass = NULL)
mass
between 0 and 1. The percentage of local IDEA
curves to plot a certainty interval.
(ggplot)
A ggplot object.
clone()
The objects of this class are cloneable with this method.
IDEA$clone(deep = FALSE)
deep
Whether to make a deep clone.
iml::FeatureEffects, iml::FeatureEffects
# load data and packages require(factoextra) require(FuzzyDBScan) multishapes = as.data.frame(multishapes[, 1:2]) # Set up an train FuzzyDBScan eps = c(0, 0.2) pts = c(3, 15) res = FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create soft label predictor predict_prob = function(model, newdata) model$predict(new_data = newdata) predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$results, predict.function = predict_prob, type = "prob") # Calculate `IDEA` global and local for feature "x" idea_x = IDEA$new(predictor = predictor, feature = "x", grid.size = 5) idea_x$plot_globals(0.5) # plot global effect of all clusters with 50 percent of local mass.
# load data and packages require(factoextra) require(FuzzyDBScan) multishapes = as.data.frame(multishapes[, 1:2]) # Set up an train FuzzyDBScan eps = c(0, 0.2) pts = c(3, 15) res = FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create soft label predictor predict_prob = function(model, newdata) model$predict(new_data = newdata) predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$results, predict.function = predict_prob, type = "prob") # Calculate `IDEA` global and local for feature "x" idea_x = IDEA$new(predictor = predictor, feature = "x", grid.size = 5) idea_x$plot_globals(0.5) # plot global effect of all clusters with 50 percent of local mass.
SMART
- Scoring Metric after PermutationSMART
estimates the importance of a feature to the clustering algorithm
by measuring changes in cluster assignments by scoring functions after
permuting selected feature. Cluster-specific SMART
indicates the importance
of specific clusters versus the remaining ones, measured by a binary scoring
metric. Global SMART
assigns importance scores across all clusters, measured
by a multi-class scoring metric. Currently, SMART
can only be used for hard
label predictors.
Let denote the multi-cluster
confusion matrix and
the binary
confusion matrix for cluster c versus the remaining clusters.
SMART
for
feature set S corresponds to:
where averages a vector of binary scores, e.g., via micro or
macro averaging.
In order to reduce variance in the estimate from shuffling the data, one can
shuffle t times and evaluate the distribution of scores. Let
denote the t-th shuffling iteration for feature set S. The
SMART
point
estimate is given by:
where extracts a sample statistic such as the mean or median or quantile.
avg
(character(1)
or NULL
)NULL
is calculating cluster-specific (binary)
metrics. "micro"
summarizes binary scores to a global
score that treats each instance in the data set with equal
importance. "macro"
summarizes binary scores to a global
score that treats each cluster with equal importance.
metric
character(1)
The binary similarity metric used.
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
data.sample
data.frame
The data, including features and cluster soft/ hard labels.
sampler
any
Sampler from the predictor
object.
features
(character or list
)
Features/ feature sets to calculate importance scores.
n.repetitions
(numeric(1)
)
How often is the shuffling of the feature repeated?
results
(data.table
)
A data.table containing the results from SMART
procedure.
new()
Create a SMART object
SMART$new( predictor, features = NULL, metric = "f1", avg = NULL, n.repetitions = 5 )
predictor
ClustPredictor
The object (created with ClustPredictor$new()
) holding
the cluster algorithm and the data.
features
(character or list
)
For which features do you want importance scores calculated. The default
value of NULL
implies all features. Use a named list of character vectors
to define groups of features for which joint importance will be calculated.
metric
character(1)
The binary similarity metric used. Defaults to f1
,
where F1 Score is used. Other possible binary scores are
"precision"
, "recall"
, "jaccard"
, "folkes_mallows"
and "accuracy"
.
avg
(character(1)
or NULL
)
Either NULL
, "micro"
or "macro"
.
Defaults to NULL
is calculating cluster-specific (binary)
metrics. "micro"
summarizes binary scores to a global
score that treats each instance in the data set with equal
importance. "macro"
summarizes binary scores to a global
score that treats each cluster with equal importance.
For unbalanced clusters, "macro"
is more recommendable.
n.repetitions
(numeric(1)
)
How often should the shuffling of the feature be repeated?
The higher the number of repetitions the more stable and
accurate the results become.
(data.frame)
data.frame with the results of the feature importance computation.
One row per feature with the following columns:
For global scores:
importance.05 (5% quantile of importance values from the repetitions)
importance (median importance)
importance.95 (95% quantile) and the permutation.error (median error over all repetitions). For cluster specific scores each column indicates for a different cluster.
print()
Print a SMART
object
SMART$print()
character
Information about predictor
, data
, metric
, and avg
and head of the results
.
plot()
plots the similarity score results of a SMART
object.
SMART$plot(log = FALSE, single_cl = NULL)
log
logical(1)
Indicator weather results should be logged. This can be
useful to distinguish the importance if similarity scores
are all close to 1.
single_cl
character(1)
Only used for cluster-specific scores (avg = NULL
).
Should match one of the cluster names.
In this case, importance scores for a single cluster are
plotted.
The plot shows the similarity per feature.
For global scores:
When n.repetitions
in SMART$new
was larger than 1, then we get
multiple similarity estimates per feature. The similarity are aggregated and
the plot shows the median similarity per feature (as dots) and also the
90%-quantile, which helps to understand how much variance the computation has
per feature.
For cluster-specific scores:
Stacks the similarity estimates of all clusters per feature.
Can be used to achieve a global estimate as a sum of
cluster-wise similarities.
ggplot2 plot object
clone()
The objects of this class are cloneable with this method.
SMART$clone(deep = FALSE)
deep
Whether to make a deep clone.
# load data and packages require(factoextra) require(FuzzyDBScan) multishapes = as.data.frame(multishapes[, 1:2]) # Set up an train FuzzyDBScan eps = c(0, 0.2) pts = c(3, 15) res = FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create hard label predictor predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters, predict.function = predict_part, type = "partition") # Run SMART globally macro_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1", avg = "macro") macro_f1 # print global SMART macro_f1$plot(log = TRUE) # plot global SMART # Run cluster specific SMART classwise_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1") macro_f1 # print regional SMART macro_f1$plot(log = TRUE) # plot regional SMART
# load data and packages require(factoextra) require(FuzzyDBScan) multishapes = as.data.frame(multishapes[, 1:2]) # Set up an train FuzzyDBScan eps = c(0, 0.2) pts = c(3, 15) res = FuzzyDBScan$new(multishapes, eps, pts) res$plot("x", "y") # create hard label predictor predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters, predict.function = predict_part, type = "partition") # Run SMART globally macro_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1", avg = "macro") macro_f1 # print global SMART macro_f1$plot(log = TRUE) # plot global SMART # Run cluster specific SMART classwise_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1") macro_f1 # print regional SMART macro_f1$plot(log = TRUE) # plot regional SMART