Spectral clustering of protein sequences

Command line interface

Author: Tamás Nepusz, Rajkumar Sasidharan, Alberto Paccanaro
Contact: tamas@cs.rhul.ac.uk

Introduction

This document introduces clusterx, the command line interface to SCPS, a tool to identify protein families and superfamilies from protein sequence data. The document is divided into the following sections:

The one-minute guide to using SCPS from the command line

The command line interface to SCPS is called clusterx and it is situated in the bin subdirectory of the SCPS package. This section illustrates some basic use-cases of SCPS from the command line. For more details, please refer to the Invocation section.

In the rest of this document, a dollar ($) sign at the start of a line in the examples represents the shell prompt of the operating system. There is no need to type it.

The easiest use-case is to conduct a spectral clustering on an input file containing id1-id2-similarity triplets:

$ clusterx input_file.txt

If your input file contains BLAST E-values, you have to transform them to similarities according to the method published in [1] (see Specifying input transformations for the exact formula). This is done as follows:

$ clusterx -t blast input_file.blast

To specify the number of clusters you need, use the -c option:

$ clusterx -c 6 -t blast input_file.blast

The maximum number of clusters you need can be specified as follows (supported only by spectral clustering and connected component analysis):

$ clusterx --param k_max=20 -t blast input_file.blast

This is preferred when you are clustering more than a few hundred sequences using the spectral clustering algorithm, as calculating the whole eigensystem can be time-consuming.

You can also supply the input file on the standard input. This is useful when you are piping E-values directly from BLAST on UNIX systems:

$ blastall -p blastp -d database -i input.fasta -m 8 | cut -f 1,2,11 | clusterx -t blast -

Invocation

clusterx can be invoked from the command line. The general invocation syntax is as follows:

clusterx [options] input_file

where options is a list of command-line options (see below) and input_file is the name of the input file to be processed. The order of the command line options is irrelevant.

Note

clusterx does no format autodetection based on the extension of the input file, so make sure you specify the appropriate input transformation when you process BLAST E-values directly.

The following command line options are recognised:

Basic command line options

-h, --help shows a general help message
-V, --version shows the version number of clusterx
-v, --verbose enters verbose mode (more output)
-q, --quiet enters quiet mode (less output, only errors will be displayed)
-m METHOD, --method METHOD
 

sets the clustering method to be used. The supported methods are as follows:

  • spectral: spectral clustering as in [1]
  • manual: spectral clustering as in [1], with manual cluster count selection based on the eigengaps
  • cca: connected component analysis
  • hierarchical: hierarchical clustering using average linkage. You must also specify the desired number of clusters or add a parameter named epsilon using the --param switch.
  • mcl: invokes an external MCL implementation to conduct a Markov clustering on E-values (which will be transformed by MCL itself, so don't use it in conjunction with -t)

The default is spectral. Please refer to Clustering algorithms for more details.

-c C, --clusters C
 sets the desired number of clusters to C. The default value is -1, meaning autodetection (assuming that the chosen method supports it).
-f FORMAT, --output-format FORMAT
 

sets the preferred output format. The supported formats are:

  • text = plain text format (default)
  • xgmml = XGMML graph for Cytoscape
-o FILE, --output FILE
 writes the result to the given file instead of the standard output.
-t TRANSFORMATION
 specifies the transformation that has to be done on every weight in the input data file before clustering. I.e., if your input file contains BLAST E-values, you have to use -t blast to convert them to similarities. More complicated transformations can be specified using a simple mini-language, see Specifying input transformations for more details.

Advanced command line options

-p PARAMS, --param PARAMS
 

sets advanced options used by some clustering algorithms. The parameters are defined using a natural syntax as follows:

name1=value1,name2=value2,...

E.g., if you want to set the value of the k_max parameter to 200, use the following command line option:

-p k_max=200

The set of advanced options are different for each algorithm, so please refer to Clustering algorithms for more details.

-s SEED, --seed SEED
 sets the seed of the random number generator to the given value. This is useful to obtain deterministic results even if the algorithm is randomised (e.g., the k-means step in the spectral clustering algorithm starts from a random configuration, but it is pretty stable in practical situations).
-S METHOD, --symmetrise METHOD
 

use the given symmetrisation method to make the input matrix symmetric. The supported methods are:

  • max: use the maximum of s(A,B) and s(B,A) (default)
  • min: use the minimum of s(A,B) and s(B,A)
  • none: don't symmetrise
--show-only FIELD
 

shows only the given field(s) from the result. The result of a clustering algorithm in clusterx is always a set of key-value pairs. There is a special key called membership which is always present in the result and it contains the calculated partition. Other keys may hold various quality measures (e.g., modularity or mass fraction). In batch processing, it is sometimes useful to restrict the result to a subset of keys that are of interest. E.g., if you are only interested in the membership vector of the obtained clustering and its mass fraction, use the following option:

--show-only "membership,mass fraction"

Note that quotation marks are required to treat membership,mass fraction as a single parameter instead of two separate ones.

Clustering algorithms

Spectral clustering

The spectral clustering method can be selected by -m spectral (this is the default). Spectral clustering primarily works with similarities, so if you use BLAST E-values in the input file, make sure you specify -t blast or an equivalent transformation to obtain similarities.

The supported parameters are as follows:

epsilon
The eigengap threshold to use when determining the number of clusters automatically. The default is 1.02, but you may consider increasing it a little bit (to around 1.03-1.05) if you need a finer grained clustering.
k_max
The maximum number of clusters to consider. Use this parameter if your input dataset is large (containing more than a thousand sequences) and you have a reasonable upper estimate on the number of clusters. SCPS will calculate only the top k_max eigenvalues and eigenvectors, which speeds up the spectral clustering process considerably.

Connected component analysis

The connected component analysis method can be selected by -m cca. CCA primarily works with similarities, so if you use BLAST E-values in the input file, make sure you specify -t blast or an equivalent transformation to obtain similarities.

The supported parameters are as follows:

epsilon
The similarity threshold to use. Edges with similarity less than this threshold will be removed before retrieving the connected components of the graph. Note that the threshold refers not to the original input value but the transformed one (if you used -t among the command line options). If you are using E-values and you specify -t blast on the command line, the epsilon parameter must refer to the transformed E-value. As a reference, we note that the most commonly used E-value threshold of 1e-6 yields a similarity threshold of 0.99999999999954.
k_max
The maximum number of clusters to consider.

Hierarchical clustering

The hierarchical clustering method can be selected by -m hierarchical. Hierarchical clustering primarily works with distances, so use the E-values (or any other suitable distance measure) as the input.

The only supported parameter is as follows:

epsilon
The height where the dendrogram produced by the hierarchical clustering will be cut. A reasonable choice is 1e-6. Note that you must specify this parameter if you don't specify a cluster count on the command line using -c as hierarchical clustering cannot tell you the optimal number of clusters automatically.

Markov clustering (TribeMCL)

This is the original TribeMCL algorithm as published in [3]. It is not implemented directly in SCPS, the application simply calls an external Markov clustering implementation for the time being. MCL supports automatic cluster count detection only, but the resolution of the algorithm can be controlled by the inflation parameter.

The Markov clustering method can be selected by -m mcl. The method primarily works with distances (E-values), but you can override that using the dont_transform parameter.

The supported parameters are as follows:

inflation
Controls the granularity of the clustering. The typical range of the inflation value is 1.2 to 5.0, and the default is 1.2. Larger inflation values result in finer grained clusterings.
dont_transform
When this parameter is specified (whatever the value is), SCPS will not use the default TribeMCL transformation to turn E-values into similarities. This means that the input values from SCPS will be passed through SCPS's own transformations only (whatever you specified on the command line using -t). When this parameter is not specified, the input values will be passed through SCPS's own transformations and the default TribeMCL transformation as well. This basically means that if you have E-values in your input file, do not specify anything using -t and do not add the dont_transform parameter, as the E-values are fine for TribeMCL. If you have similarities in your input file, do use the dont_transform parameter so TribeMCL's transformation step is skipped.
mcl_path
This parameter specifies the path to the MCL executable if SCPS is not able to find it on its own. SCPS generally assumes that MCL can be started from the command line by typing mcl. If MCL is not on the default path of the operating system, you may have to specify its full path using the mcl_path parameter.

Specifying input transformations

Custom transformations on the input data can be specified using the -t command line option. The option expects a single parameter, the transformation itself, which is specified using a simple mini-language. The mini-language is more or less compatible with the one used by MCL's -abc-tf option.

The basic building blocks of an SCPS transformation are denoted as follows:

Multiple operations can be chained to obtain a more complicated transformation, just simply concatenate them by a comma. For example, neglog, mul(0.4343), clamp(0, 200) is equivalent to the transformation that TribeMCL uses when converting E-values to similarities. This transformation takes the negative natural logarithm of the input value, multiplies it by 0.4343 to obtain the common (10-based) logarithm and then clamps the input values between 0 and 200. An equivalent formulation would be neglog(10), clamp(0, 200). Note that you don't have to specify this transformation when using an external MCL implementation from SCPS (-m mcl), SCPS will invoke MCL in a way that the transformation itself is performed within MCL.

References

If you use results calculated by clusterx in a publication, please cite one of the following references:

[1](1, 2, 3, 4, 5) Nepusz T, Sasidharan R, Paccanaro A: SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinformatics 11:120, 2010.
[2]Paccanaro A, Casbon JA, Saqi MA: Spectral clustering of protein sequences. Nucleic Acids Res 34(5):1571-80, 2006.

Bibliography

[3]Enright AJ, van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575-84, 2002.