Main window
This window is the one you see when you start the graphical interface. The
window is divided into three parts. The top frame allows you to select
the input file in the Input filename textbox and specify the
transformation you want to perform on the values in the input file before
starting the clustering in the Transformation combo box. Note that
the transformation is selected automatically if the extension of your
input file is one of the known ones (see Supported input formats):
E-values obtained from FASTA files and files ending in .blast will be
transformed using a non-linear similarity transformation, while the similarity
values in .sim will be left intact.
Before the analysis, a symmetrisation process will be performed. The symmetrisation
process takes all A-B pairs of elements, checks the similarity function for A-B
and for B-A, and keeps either the higher or the lower value. The symmetrisation
method can be set using the Symmetrisation combo box. It is safe to use the
max(A, B) similarity method in most of the cases.
Warning
There is a common catch associated with the min(A, B) symmetrisation
method. If you don't specify a similarity value for a pair of elements
in the input file, it is assumed to be zero. Hence, if there exists a pair
of elements A and B where a similarity value is specified for A-B but not
for B-A, the value will be ignored as the missing one is treated as zero,
which is smaller than any positive weight.
The middle frame in the main window is used to select the algorithm you
wish to run and also tune its behaviour. The algorithm can be selected using
the Clustering algorithm combo box. The Number of clusters combo box
allows you to tune how the algorithm decides the number of clusters. The
following methods are available (but not all clustering algorithms support
all the methods, so some may be hidden depending on your algorithm selection):
- Automatic
- The algorithm will try to select the number of clusters automatically.
The exact details of the process differ for each algorithm. Please
refer to the Clustering algorithms section for more details.
- Exactly
- The algorithm will try to create exactly k clusters, where k is
a number given by you in a textbox that appears when you select
this option in the Number of clusters combobox. Note that sometimes
the requested cluster count cannot be satisfied; e.g., you cannot create
5 clusters when your original similarity dataset consists of more than 5
connected components. In this case, the algorithm will try to get as close
to the desired number of clusters as possible.
- At most
- This method is similar to Automatic, but it imposes an upper bound on
the number of clusters. It is highly advised to use this option when you
run the spectral clustering algorithm on large datasets (larger than a
thousand sequences or so), as it saves time and resources: the algorithm
will calculate only the top k eigenvalues and eigenvectors, which can be
done more efficiently for sparse input matrices.
- Manually
- This method is supported only for the spectral clustering algorithm:
it will calculate the top 100 eigenvalues and eigenvectors and lets you
select the number of clusters based on the eigenvalues and eigengaps.
Due to the fact that only the top 100 eigenvalues are computed, this method
is suitable for datasets where you don't expect more than a hundred clusters.
See the section on the Cluster count selector window for more details.
You can also tweak the advanced parameters of each algorithm in the
Advanced algorithm parameters window that is shown after clicking on the
Parameters... button. Refer to the Clustering algorithms section
for more details on the parameter names and values you can use there.
The computation can be started by clicking on the Start button. A progress
bar at the bottom of the window will show you the approximate progress of the
computation. Note that it is not possible to estimate the remaining time for
eigenvector calculations accurately, hence the progress bar will not move while
the eigenvectors are calculated for the spectral clustering process. It will,
however, display the exact progress when performing a connected component
analysis with automatic cluster count selection. When the calculation is
finished, the result viewer window will be shown; please refer to the Result
viewer section for more details.
The Show log menu item in the Window menu can be used to display
diagnostics messages that may help you check what is going on behind the
scenes. In general, you shouldn't need this window unless you suspect something
is wrong with SCPS and you wish to file a bug report.
Cluster count selector window
The cluster count selector window is shown when you selected the Manual cluster
count selection method in the main window and you are using the spectral clustering
method. The window appears after the eigenvector calculations are finished. You can
check the top 100 eigenvalues, the corresponding eigengaps (i.e. the differences
between successive eigenvalues) and eigenratios (i.e. the ratios of successive eigenvalues).
The eigengaps and eigenratios may help you determine the appropriate cluster count:
in case of k well-separated clusters in your original dataset, you will see a
sudden drop after the k-th eigenvalue, which is also reflected in a sudden increase
in the eigengaps and the eigenratios. When SCPS is in fully automatic mode, it
uses the eigenratios to determine the number of clusters: the number of clusters is
chosen to be the smallest k such that the ratio between the kth and the (k+1)th eigenvalue
is larger than a predefined threshold, which is set to 1.01 by default. This cluster
count is shown as a suggestion in the cluster count selector window, but you can
override it before clicking on OK.
Advanced algorithm parameters
Each algorithm in SCPS has a set of parameters that can be used to tweak the behaviour
of the algorithm. For instance, the spectral clustering algorithm uses a threshold
on the eigenratios to determine the optimal number of clusters. This threshold is set
to 1.01 by default, but you can override it if you want. The Advanced algorithm
parameters dialog box can be used to enter values for the parameters you wish
to override. To add a new parameter, simply start typing its name into the Name
column in the window and then add its corresponding value in the Value column.
A new row will be added to the parameter table if all the rows are full. If you
wish to delete a parameter that you have entered previously, simply erase the name
and the value from the corresponding row.
For more details on the parameter names and values you can use for each algorithm,
please refer to the Clustering algorithms section.
Result viewer
The result viewer window is shown at the end of the calculation. It allows you
to examine the clusters and calculate various internal quality measures. The window
consists of a listbox on the left which allows you to select the cluster or measure
you wish to examine and a large text box on the right which shows the selected cluster
or the value of the selected measure. To save the contents of the text box to a file,
click on the Save button on the toolbar. Alternatively, you can use the Save all
button to dump the selected clustering as a whole into a TXT or XGMML file. See the
section on the Supported output formats for more details.
The following quality measures are calculated for the resulting partition:
- Mass fraction
- A simple internal quality measure that quantifies the fraction of the total similarity
values concentrated within the clusters. It is formally defined as the sum of
similarity values for each pair of elements that are in the same cluster,
divided by the total similarity over the whole network. A disadvantage of this
measure is that it attains its maximum when all elements are within the same
cluster, thus maximising the mass fraction alone is not meaningful. However, one
can safely infer that the partition is bad if the mass fraction is very small.
- Modularity
- A more sophisticated internal quality measure that quantifies how much larger
is the total intra-cluster similarity from the one that we would observe on
completely random datasets having the same similarity distribution as the
dataset being analysed. The advantage of the modularity measure is that it is
zero for trivial partitions (i.e. when all vertices are within the same cluster
of when all vertices are in different clusters). The modularity measure is
explicitly maximised during connected component analysis and hierarchical clustering
when automatic cluster count selection is used. For more details and an exact
definition of the modularity measure, see .
- Heatmap of the rearranged similarity matrix
- This quality measure is not an exact numeric value, but it provides a
visual cue to the performance of the clustering algorithm. The basic idea
is that the initial similarity matrix can be plotted as a heatmap where
each pixel corresponds to a single cell of the matrix and the intensity of
the pixel is proportional to the weight that the corresponding cell in the
matrix represents. The rows and columns of the similarity matrix can be
arranged in arbitrary order, but if one arranges them in a way that rows
and columns corresponding to the same cluster are next to each other, the
resulting heatmap will show a block-diagonal structure when the clustering
is good. On the heatmaps produced by SCPS, white pixels correspond to low
similarity and black pixels correspond to high similarity. The heatmaps
can also be exported in PNG, JPG, BMP or TIF format by clicking on the Save
button.
Note
The modularity and the mass fraction is calculated only when the algorithm
is working on similarity values as they do not make sense when the input
data contains distances.
The cluster viewer
Whenever you click on a cluster in the result window, the members of the cluster
will be listed on the right side. If you used a FASTA input file with standard
NCBI deflines, the GenBank accession numbers will be shown; otherwise, the ID
from the input file will be kept as is. You can switch to a graph visualisation
view by clicking on the second button under the list (the one showing a schematic
graph with three vertices), and of course you can switch back to the list view
by clicking on the first button (the one showing a schematic list with four items).
Clicking on the Save button on the toolbar while showing the list will save
the members of the cluster to a file, while clicking on it while showing the graph
will export the graph in PNG, JPG, BMP or TIF format.
You may also attach short textual descriptions to the cluster member IDs to
make the results more comprehensible. If you used a FASTA input file and it had
NCBI deflines, the descriptions from the FASTA file should have appeared
automatically. If your FASTA file did not contain NCBI deflines, the first
word of the defline is considered as the ID and the rest is considered as the
description. If your input file was not in FASTA format, you can still attach
descriptions by clicking on the Descriptions button below the list and
loading a FASTA or text file later. Textual description files must contain
two columns separated by tabs, the first column being the ID and the second
being the description. You can create such a file easily from Excel by choosing
to save the file in tab delimited format.
Assuming that the ID column in the cluster viewer contains GenBank IDs, you can
also fetch the descriptions directly from NCBI by clicking in the Retrieve from NCBI
menu item in the popup menu of the Descriptions button. The descriptions
will be downloaded using NCBI's EUtils service.