Spectral clustering of protein sequences

Author: Tamás Nepusz, Rajkumar Sasidharan, Alberto Paccanaro
Contact: tamas@cs.rhul.ac.uk

If you use results calculated by SCPS in a publication, please cite one of the suggested references (see below).

Introduction

This document introduces SCPS, a tool to identify homologous proteins from protein sequence data. The document is divided into the following sections:

The one-minute guide to using SCPS

  1. Prepare your data file. SCPS accepts pairwise similarity data, pairwise BLAST E-values or FASTA files. Refer to the Supported input formats section for more details.
  2. Select your data file in the main window. SCPS will try to autodetect the file format based on the file extension and select the appropriate input transformation. If you use an extension that is not listed above, make sure you select the appropriate transformation in the Transformation combo box.
  3. Select the clustering algorithm you want to use. Spectral clustering is preferred in most of the cases.
  4. Select the number of clusters you want. You can let SCPS determine the number of clusters automatically, or you can specify an exact or a maximum cluster count. You can also opt for selecting the number of clusters manually based on the eigenvalues of the similarity matrix if you are using the spectral clustering algorithm.
  5. Press Start and wait for the results. The calculation should not take more than a couple of seconds for approximately a thousand protein sequences if you are using a decent desktop machine.
  6. Examine the results. The result of each algorithm in SCPS is a set of key-value pairs. Each result set will contain a key called membership, the value corresponding to this key is the membership vector of the resulting partition (i.e., it gives you a cluster index for each element in your data set). You can also list the members of each cluster individually by selecting the appropriate cluster in the box on the left hand side of the result window. Small badges show you the size of each cluster.
  7. Export the results. Click on the Save button in the result window to save the result shown in the large textbox to a text file. Alternatively, click on the Save All button to save the complete result set to a file. Complete result sets can be saved either into a plain text file that contains the membership vector and the quality measures, or to an XGMML file that can be imported into Cytoscape later to visualise the results.

If you want to learn more about the SCPS graphical user interface, read on.

The SCPS graphical user interface

Main window

This window is the one you see when you start the graphical interface. The window is divided into three parts. The top frame allows you to select the input file in the Input filename textbox and specify the transformation you want to perform on the values in the input file before starting the clustering in the Transformation combo box. Note that the transformation is selected automatically if the extension of your input file is one of the known ones (see Supported input formats): E-values obtained from FASTA files and files ending in .blast will be transformed using a non-linear similarity transformation, while the similarity values in .sim will be left intact.

Before the analysis, a symmetrisation process will be performed. The symmetrisation process takes all A-B pairs of elements, checks the similarity function for A-B and for B-A, and keeps either the higher or the lower value. The symmetrisation method can be set using the Symmetrisation combo box. It is safe to use the max(A, B) similarity method in most of the cases.

Warning

There is a common catch associated with the min(A, B) symmetrisation method. If you don't specify a similarity value for a pair of elements in the input file, it is assumed to be zero. Hence, if there exists a pair of elements A and B where a similarity value is specified for A-B but not for B-A, the value will be ignored as the missing one is treated as zero, which is smaller than any positive weight.

The middle frame in the main window is used to select the algorithm you wish to run and also tune its behaviour. The algorithm can be selected using the Clustering algorithm combo box. The Number of clusters combo box allows you to tune how the algorithm decides the number of clusters. The following methods are available (but not all clustering algorithms support all the methods, so some may be hidden depending on your algorithm selection):

Automatic
The algorithm will try to select the number of clusters automatically. The exact details of the process differ for each algorithm. Please refer to the Clustering algorithms section for more details.
Exactly
The algorithm will try to create exactly k clusters, where k is a number given by you in a textbox that appears when you select this option in the Number of clusters combobox. Note that sometimes the requested cluster count cannot be satisfied; e.g., you cannot create 5 clusters when your original similarity dataset consists of more than 5 connected components. In this case, the algorithm will try to get as close to the desired number of clusters as possible.
At most
This method is similar to Automatic, but it imposes an upper bound on the number of clusters. It is highly advised to use this option when you run the spectral clustering algorithm on large datasets (larger than a thousand sequences or so), as it saves time and resources: the algorithm will calculate only the top k eigenvalues and eigenvectors, which can be done more efficiently for sparse input matrices.
Manually
This method is supported only for the spectral clustering algorithm: it will calculate the top 100 eigenvalues and eigenvectors and lets you select the number of clusters based on the eigenvalues and eigengaps. Due to the fact that only the top 100 eigenvalues are computed, this method is suitable for datasets where you don't expect more than a hundred clusters. See the section on the Cluster count selector window for more details.

You can also tweak the advanced parameters of each algorithm in the Advanced algorithm parameters window that is shown after clicking on the Parameters... button. Refer to the Clustering algorithms section for more details on the parameter names and values you can use there.

The computation can be started by clicking on the Start button. A progress bar at the bottom of the window will show you the approximate progress of the computation. Note that it is not possible to estimate the remaining time for eigenvector calculations accurately, hence the progress bar will not move while the eigenvectors are calculated for the spectral clustering process. It will, however, display the exact progress when performing a connected component analysis with automatic cluster count selection. When the calculation is finished, the result viewer window will be shown; please refer to the Result viewer section for more details.

The Show log menu item in the Window menu can be used to display diagnostics messages that may help you check what is going on behind the scenes. In general, you shouldn't need this window unless you suspect something is wrong with SCPS and you wish to file a bug report.

Cluster count selector window

The cluster count selector window is shown when you selected the Manual cluster count selection method in the main window and you are using the spectral clustering method. The window appears after the eigenvector calculations are finished. You can check the top 100 eigenvalues, the corresponding eigengaps (i.e. the differences between successive eigenvalues) and eigenratios (i.e. the ratios of successive eigenvalues). The eigengaps and eigenratios may help you determine the appropriate cluster count: in case of k well-separated clusters in your original dataset, you will see a sudden drop after the k-th eigenvalue, which is also reflected in a sudden increase in the eigengaps and the eigenratios. When SCPS is in fully automatic mode, it uses the eigenratios to determine the number of clusters: the number of clusters is chosen to be the smallest k such that the ratio between the kth and the (k+1)th eigenvalue is larger than a predefined threshold, which is set to 1.01 by default. This cluster count is shown as a suggestion in the cluster count selector window, but you can override it before clicking on OK.

Advanced algorithm parameters

Each algorithm in SCPS has a set of parameters that can be used to tweak the behaviour of the algorithm. For instance, the spectral clustering algorithm uses a threshold on the eigenratios to determine the optimal number of clusters. This threshold is set to 1.01 by default, but you can override it if you want. The Advanced algorithm parameters dialog box can be used to enter values for the parameters you wish to override. To add a new parameter, simply start typing its name into the Name column in the window and then add its corresponding value in the Value column. A new row will be added to the parameter table if all the rows are full. If you wish to delete a parameter that you have entered previously, simply erase the name and the value from the corresponding row.

For more details on the parameter names and values you can use for each algorithm, please refer to the Clustering algorithms section.

Result viewer

The result viewer window is shown at the end of the calculation. It allows you to examine the clusters and calculate various internal quality measures. The window consists of a listbox on the left which allows you to select the cluster or measure you wish to examine and a large text box on the right which shows the selected cluster or the value of the selected measure. To save the contents of the text box to a file, click on the Save button on the toolbar. Alternatively, you can use the Save all button to dump the selected clustering as a whole into a TXT or XGMML file. See the section on the Supported output formats for more details.

The following quality measures are calculated for the resulting partition:

Mass fraction
A simple internal quality measure that quantifies the fraction of the total similarity values concentrated within the clusters. It is formally defined as the sum of similarity values for each pair of elements that are in the same cluster, divided by the total similarity over the whole network. A disadvantage of this measure is that it attains its maximum when all elements are within the same cluster, thus maximising the mass fraction alone is not meaningful. However, one can safely infer that the partition is bad if the mass fraction is very small.
Modularity
A more sophisticated internal quality measure that quantifies how much larger is the total intra-cluster similarity from the one that we would observe on completely random datasets having the same similarity distribution as the dataset being analysed. The advantage of the modularity measure is that it is zero for trivial partitions (i.e. when all vertices are within the same cluster of when all vertices are in different clusters). The modularity measure is explicitly maximised during connected component analysis and hierarchical clustering when automatic cluster count selection is used. For more details and an exact definition of the modularity measure, see [3].
Heatmap of the rearranged similarity matrix
This quality measure is not an exact numeric value, but it provides a visual cue to the performance of the clustering algorithm. The basic idea is that the initial similarity matrix can be plotted as a heatmap where each pixel corresponds to a single cell of the matrix and the intensity of the pixel is proportional to the weight that the corresponding cell in the matrix represents. The rows and columns of the similarity matrix can be arranged in arbitrary order, but if one arranges them in a way that rows and columns corresponding to the same cluster are next to each other, the resulting heatmap will show a block-diagonal structure when the clustering is good. On the heatmaps produced by SCPS, white pixels correspond to low similarity and black pixels correspond to high similarity. The heatmaps can also be exported in PNG, JPG, BMP or TIF format by clicking on the Save button.

Note

The modularity and the mass fraction is calculated only when the algorithm is working on similarity values as they do not make sense when the input data contains distances.

The cluster viewer

Whenever you click on a cluster in the result window, the members of the cluster will be listed on the right side. If you used a FASTA input file with standard NCBI deflines, the GenBank accession numbers will be shown; otherwise, the ID from the input file will be kept as is. You can switch to a graph visualisation view by clicking on the second button under the list (the one showing a schematic graph with three vertices), and of course you can switch back to the list view by clicking on the first button (the one showing a schematic list with four items). Clicking on the Save button on the toolbar while showing the list will save the members of the cluster to a file, while clicking on it while showing the graph will export the graph in PNG, JPG, BMP or TIF format.

You may also attach short textual descriptions to the cluster member IDs to make the results more comprehensible. If you used a FASTA input file and it had NCBI deflines, the descriptions from the FASTA file should have appeared automatically. If your FASTA file did not contain NCBI deflines, the first word of the defline is considered as the ID and the rest is considered as the description. If your input file was not in FASTA format, you can still attach descriptions by clicking on the Descriptions button below the list and loading a FASTA or text file later. Textual description files must contain two columns separated by tabs, the first column being the ID and the second being the description. You can create such a file easily from Excel by choosing to save the file in tab delimited format.

Assuming that the ID column in the cluster viewer contains GenBank IDs, you can also fetch the descriptions directly from NCBI by clicking in the Retrieve from NCBI menu item in the popup menu of the Descriptions button. The descriptions will be downloaded using NCBI's EUtils service.

Preferences window

Currently the preference window can be used only to set the path to external tools that SCPS can use. In particular, you can tweak the following settings:

BLAST path
Set this to the directory containing the BLAST executables (i.e. formatdb and blastall on Linux and Mac OS X, formatdb.exe and blastall.exe on Windows) if you want to use FASTA files directly in SCPS. SCPS will complain when you try to use a FASTA file directly and you did not set the proper path here.
MCL path
Set this to the full path of the MCL executable (i.e. mcl on Linux and Mac OS X, mcl.exe on Windows) if you want to use the TribeMCL algorithm on your dataset. SCPS will call the MCL executable in the background and parse its output to obtain the clustering results.

Clustering algorithms

Spectral clustering

Input

This method requires similarity values as its input. If you use a file containing BLAST E-values or run SCPS on a FASTA file directly, convert the E-values to similarities by using the appropriate transformation in the Input transformation combo box.

Description

Spectral clustering is the algorithm defined in [1], which is a slighly modified version of the spectral clustering published in [4]. The algorithm uses the eigenvalues and eigenvectors of the normalised similarity matrix to derive the clusters. For more details, see [1].

When you specify an exact cluster count k, the algorithm will have to compute at most the top k eigenvalues and eigenvectors of the normalised similarity matrix. The same is true when you specify that you need at most k clusters. However, if you don't specify a maximum cluster count, the algorithm will have to compute all the eigenvalues and eigenvectors of the matrix to choose the number of clusters, therefore it is highly suggested that you specify a maximum cluster count unless you have less than a thousand sequences in your dataset or you have plenty of time.

When the algorithm doesn't get the exact cluster count as the input, it decides the number of clusters based on a simple procedure that evaluates the top k eigenvalues of the normalised similarity matrix. First, the k-1 eigenratios of the top k eigenvalues are calculated by ordering the eigenvalues in decreasing order and taking the ratio of successive eigenvalues. The number of clusters will be selected by looking for the first eigenratio which is larger than a prescribed threshold epsilon. epsilon is set to 1.02 by default, which is suitable for most of the datasets, but you may consider increasing it to around 1.03 for smaller datasets.

Advanced parameters

The only advanced parameter that you can tweak is the above mentioned epsilon value. epsilon can be modified by opening the Advanced algorithm parameters dialog, entering epsilon to the Name column and entering the corresponding value to the Value column. Make sure you enter epsilon with lowercase letters and no added whitespace before or after it.

Markov clustering (TribeMCL)

Input

This algorithm works directly on E-values which are transformed to similarities by taking the negative base 10 logarithm. Don't use this method directly on similarity values.

Description

This is the original TribeMCL algorithm as published in [5]. It is not implemented directly in SCPS, the application simply calls an external Markov clustering implementation for the time being. You must specify the full path to the MCL executable in the Preferences window if you want to use this method. MCL supports automatic cluster count detection only, but the resolution of the algorithm can be controlled by the inflation parameter. Use the Advanced algorithm parameters dialog to tweak the value of the parameter.

You can obtain the source code of MCL from http://www.micans.org/mcl/. If you are using Linux or OS X, you should be able to compile the MCL executable from source without problems. Some Linux distributions (e.g., Debian and Ubuntu) provide MCL in their package manager, so be sure to look there first before you try rolling your own MCL.

Advanced parameters

The only advanced parameter that you can tweak is the above mentioned inflation value. The typical range for the inflation value is between 1.2 and 5.0, and smaller inflation values correspond to larger, coarse-grained clusters. The inflation value can be modified by opening the Advanced algorithm parameters dialog, entering inflation to the Name column and entering the corresponding value to the Value column. Make sure you enter inflation with lowercase letters and no added whitespace before or after it.

Connected component analysis

Input

This method requires similarity values as its input. If you use a file containing BLAST E-values or run SCPS on a FASTA file directly, convert the E-values to similarities by using the appropriate transformation in the Input transformation combo box.

Description

Connected component analysis (CCA) is a simple graph-theoretic approach to find homologs based solely on sequence similarity. CCA works on the similarity graph derived from the input similarity matrix as follows: each vertex of the graph will correspond to a sequence and each edge corresponds to a similarity relationship. An edge is present between two vertices if the corresponding sequences have a similarity larger than a given threshold epsilon. CCA finds the connected components of this graph (i.e., the sets of vertices that are mutually reachable from each other via the edges) and then considers sequences corresponding to the vertices within the same connected components as homologs.

When you specify an exact cluster count k, CCA will start from the edge conveying the largest similarity value and merges the vertices one by one into connected components (always merging the two vertices connected by the highest similarity value in that step) until there are no more edges to merge or there are only k connected components left. Otherwise, CCA will start the same process and merges vertices until only a single component is left and calculates the modularity [3] of each resulting partition. Finally, the partition with the highest modularity is chosen. Note that none of these methods requires an exact epsilon, it will be determined automatically. However, you can also specify an exact epsilon, which will run the simplified process mentioned in the first paragraph.

Advanced parameters

The only advanced parameter that you can tweak is the above mentioned epsilon value. epsilon can be modified by opening the Advanced algorithm parameters dialog, entering epsilon to the Name column and entering the corresponding value to the Value column. Make sure you enter epsilon with lowercase letters and no added whitespace before or after it. If you specify epsilon, it doesn't matter anymore which cluster count selection method you choose, as your choice of epsilon uniquely determines the resulting partition.

Hierarchical clustering

Input

This algorithm works directly on E-values. Don't use this method directly on similarity values.

Description

Hierarchical clustering is a family of clustering methods that start with individual data points (i.e. the sequences) and then build a tree by iteratively merging the closest points until only one is left. The final cluster assignment is then determined by cutting the branches of the tree at a specific level. The various hierarchical clustering methods usually differ only in the way they define the distance between two sets of data points and the way they choose the optimal height to cut the branches of the tree in the end.

SCPS uses the average distance metric for merging clusters. This metric defines the distance between two sets of data points as the average distance between all possible point pairs such that one point is chosen from one of the sets and the other one is from the other set. The default cutoff point in the tree is at the height corresponding to 0.000001, which is suitable for E-values.

Advanced parameters

There are two parameters that you can modify:

epsilon
This represents the height where the branches of the dendrogram are cut in the end to obtain the final cluster assignment.
max_distance
The theoretically possible maximum distance between data points. This distance will be used for data points for which there is no corresponding distance data in the input file. Since we expect the input file to contain E-values and the default BLAST E-value threshold is 10 (i.e. a match is not reported in the BLAST output if the E-value would be larger than 10), the default value of this parameter is also 10. Should you want to conduct hierarchical clustering in other datasets that do not contain E-values, adjust this variable accordingly. If you work with E-values, there should be no need to modify this parameter.

Supported file formats

Supported input formats

SCPS supports the following input formats:

Pairwise similarity data (.sim)
These files must contain lies of the following form: id1 id2 similarity, where id1 and id2 are arbitrary identifiers not containing whitespaces, and similarity is their corresponding similarity score between 0 and 1. It is assumed that the self-similarity of every element is 1. Data of this kind is accepted by Spectral clustering and Connected component analysis only.
Pairwise BLAST E-values (.blast)
These files are similar to the pairwise similarity data files, but they contain a BLAST E-value instead of a similarity score. Data of this kind is accepted by all the algorithms, but you have to transform them to similarities first when using Spectral clustering or Connected component analysis by selecting the appropriate entry in the Input transformation combo box.
FASTA files (.fasta)

You can also use FASTA files containing sequence information directly. These files will be passed to formatdb and blastall to obtain BLAST E-values, which will be transformed to similarities according to [1]. BLAST will perform an all-against-all query on the given FASTA file. Note that you must specify the path to the BLAST executables first in the Preferences window. Data of this kind is accepted by all the algorithms, but you have to transform them to similarities first when using Spectral clustering or Connected component analysis by selecting the appropriate entry in the Input transformation combo box.

It is advised to use standard NCBI deflines in your FASTA files as formatdb will attempt to parse them. Deflines not recognised will be considered as local sequence identifiers (lcl|identifier in the NCBI defline format). If a defline conforms to the NCBI defline format, SCPS will extract the GenBank accession number from it (the part between the first and second vertical bars) and use that later in the result window.

Supported output formats

SCPS supports the following output formats (you can use these in the Result viewer dialog after clicking on the Save all button):

Plain text output (.txt)
The plain text output will contain the calculated quality measures and the resulting partition in a simple human-readable format. The output consists of a few header lines and a membership vector. The header lines start with a percentage mark (%) and they contain the calculated quality measures. The rest of the file describes the partition itself. Each line corresponds to one element of the set being clustered. The line consists of the element name (as found in the input file) and the corresponding cluster index, separated by tabs. Cluster indices start from 1 and elements in the same cluster have the same cluster index. You can load this file into Matlab later using Matlab's textread function which ignores rows starting with a percentage mark if you set the comment style to be matlab.
XGMML output (.xgmml)
The XGMML output file is essentially a graph representation of the symmetrised input data in XML format. The graph contains an edge attribute named weight that corresponds to the similarity value conveyed by that edge, and a vertex attribute named membership that can be used to recover the partition. Several graph-level attributes are used to store the quality measures calculated by SCPS. XGMML files can easily be imported into Cytoscape for visualisation, see Visualising results in Cytoscape for more details.

Working with external tools

Feeding sequence data directly to BLAST

SCPS can operate directly on FASTA files. If you use a FASTA file as input, the sequences in the file will be piped to formatdb to build a database, and an all-versus-all query will be run on this database using blastall. The returned similarity values will then be used to construct the input dataset on-the-fly.

Visualising results in Cytoscape

SCPS can export the calculated partition into an XGMML file which can later be imported into Cytoscape for visualisation. Assuming that the partition is already calculated and the Result viewer is shown on the screen, do the following:

  1. Click on the Save all button, select Cytoscape XGMML files as the file format, enter the name of the output file and click on Save.
  2. Load the XGMML file into Cytoscape by clicking on File / Import / Network (multiple file types)... and selecting the XGMML file.
  3. Try to create a visually pleasing layout using Cytoscape's built-in layout methods. Layout / yFiles / Organic usually gives good results.
  4. Use the VizMapper tab on the dock on the left to adjust the vertex and edge colors:
    • The membership attribute of vertices can be used to colour the vertices based on their cluster membership using a discrete mapper. The rainbow color scheme generator usually works well if there are not too many clusters. Note that the clusters are sorted by decreasing size, so if you want to show the top five clusters only, you can set a color to values 1-5 manually and leave the rest as is.
    • The weight attribute of edges can be used to emphasize edges with higher similarity. The recommended approach is to use this attribute to control the opacity of edges using a linear mapper.

References

If you use results calculated by SCPS in a publication, please cite one of the following references:

[1](1, 2, 3) Nepusz T, Sasidharan R, Paccanaro A: SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinformatics 11:120, 2010.
[2]Paccanaro A, Casbon JA, Saqi MA: Spectral clustering of protein sequences. Nucleic Acids Res 34(5):1571-80, 2006.

Bibliography

[3](1, 2) Newman MEJ: Fast algorithm for detecting community structure in networks. Phys Rev E 69(6):066133, 2004.
[4]Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14:849-56, MIT Press, 2001.
[5]Enright AJ, van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575-84, 2002.