| Author: | Tamás Nepusz, Rajkumar Sasidharan, Alberto Paccanaro |
|---|---|
| Contact: | tamas@cs.rhul.ac.uk |
This document introduces clusterx, the command line interface to SCPS, a tool to identify protein families and superfamilies from protein sequence data. The document is divided into the following sections:
The command line interface to SCPS is called clusterx and it is situated in the bin subdirectory of the SCPS package. This section illustrates some basic use-cases of SCPS from the command line. For more details, please refer to the Invocation section.
In the rest of this document, a dollar ($) sign at the start of a line in the examples represents the shell prompt of the operating system. There is no need to type it.
The easiest use-case is to conduct a spectral clustering on an input file containing id1-id2-similarity triplets:
$ clusterx input_file.txt
If your input file contains BLAST E-values, you have to transform them to similarities according to the method published in [1] (see Specifying input transformations for the exact formula). This is done as follows:
$ clusterx -t blast input_file.blast
To specify the number of clusters you need, use the -c option:
$ clusterx -c 6 -t blast input_file.blast
The maximum number of clusters you need can be specified as follows (supported only by spectral clustering and connected component analysis):
$ clusterx --param k_max=20 -t blast input_file.blast
This is preferred when you are clustering more than a few hundred sequences using the spectral clustering algorithm, as calculating the whole eigensystem can be time-consuming.
You can also supply the input file on the standard input. This is useful when you are piping E-values directly from BLAST on UNIX systems:
$ blastall -p blastp -d database -i input.fasta -m 8 | cut -f 1,2,11 | clusterx -t blast -
clusterx can be invoked from the command line. The general invocation syntax is as follows:
clusterx [options] input_file
where options is a list of command-line options (see below) and input_file is the name of the input file to be processed. The order of the command line options is irrelevant.
Note
clusterx does no format autodetection based on the extension of the input file, so make sure you specify the appropriate input transformation when you process BLAST E-values directly.
The following command line options are recognised:
| -h, --help | shows a general help message |
| -V, --version | shows the version number of clusterx |
| -v, --verbose | enters verbose mode (more output) |
| -q, --quiet | enters quiet mode (less output, only errors will be displayed) |
| -m METHOD, --method METHOD | |
sets the clustering method to be used. The supported methods are as follows:
The default is spectral. Please refer to Clustering algorithms for more details. | |
| -c C, --clusters C | |
| sets the desired number of clusters to C. The default value is -1, meaning autodetection (assuming that the chosen method supports it). | |
| -f FORMAT, --output-format FORMAT | |
sets the preferred output format. The supported formats are:
| |
| -o FILE, --output FILE | |
| writes the result to the given file instead of the standard output. | |
| -t TRANSFORMATION | |
| specifies the transformation that has to be done on every weight in the input data file before clustering. I.e., if your input file contains BLAST E-values, you have to use -t blast to convert them to similarities. More complicated transformations can be specified using a simple mini-language, see Specifying input transformations for more details. | |
| -p PARAMS, --param PARAMS | |
sets advanced options used by some clustering algorithms. The parameters are defined using a natural syntax as follows: name1=value1,name2=value2,... E.g., if you want to set the value of the k_max parameter to 200, use the following command line option: -p k_max=200 The set of advanced options are different for each algorithm, so please refer to Clustering algorithms for more details. | |
| -s SEED, --seed SEED | |
| sets the seed of the random number generator to the given value. This is useful to obtain deterministic results even if the algorithm is randomised (e.g., the k-means step in the spectral clustering algorithm starts from a random configuration, but it is pretty stable in practical situations). | |
| -S METHOD, --symmetrise METHOD | |
use the given symmetrisation method to make the input matrix symmetric. The supported methods are:
| |
| --show-only FIELD | |
shows only the given field(s) from the result. The result of a clustering algorithm in clusterx is always a set of key-value pairs. There is a special key called membership which is always present in the result and it contains the calculated partition. Other keys may hold various quality measures (e.g., modularity or mass fraction). In batch processing, it is sometimes useful to restrict the result to a subset of keys that are of interest. E.g., if you are only interested in the membership vector of the obtained clustering and its mass fraction, use the following option: --show-only "membership,mass fraction" Note that quotation marks are required to treat membership,mass fraction as a single parameter instead of two separate ones. | |
The spectral clustering method can be selected by -m spectral (this is the default). Spectral clustering primarily works with similarities, so if you use BLAST E-values in the input file, make sure you specify -t blast or an equivalent transformation to obtain similarities.
The supported parameters are as follows:
The connected component analysis method can be selected by -m cca. CCA primarily works with similarities, so if you use BLAST E-values in the input file, make sure you specify -t blast or an equivalent transformation to obtain similarities.
The supported parameters are as follows:
The hierarchical clustering method can be selected by -m hierarchical. Hierarchical clustering primarily works with distances, so use the E-values (or any other suitable distance measure) as the input.
The only supported parameter is as follows:
This is the original TribeMCL algorithm as published in [3]. It is not implemented directly in SCPS, the application simply calls an external Markov clustering implementation for the time being. MCL supports automatic cluster count detection only, but the resolution of the algorithm can be controlled by the inflation parameter.
The Markov clustering method can be selected by -m mcl. The method primarily works with distances (E-values), but you can override that using the dont_transform parameter.
The supported parameters are as follows:
Custom transformations on the input data can be specified using the -t command line option. The option expects a single parameter, the transformation itself, which is specified using a simple mini-language. The mini-language is more or less compatible with the one used by MCL's -abc-tf option.
The basic building blocks of an SCPS transformation are denoted as follows:
Multiple operations can be chained to obtain a more complicated transformation, just simply concatenate them by a comma. For example, neglog, mul(0.4343), clamp(0, 200) is equivalent to the transformation that TribeMCL uses when converting E-values to similarities. This transformation takes the negative natural logarithm of the input value, multiplies it by 0.4343 to obtain the common (10-based) logarithm and then clamps the input values between 0 and 200. An equivalent formulation would be neglog(10), clamp(0, 200). Note that you don't have to specify this transformation when using an external MCL implementation from SCPS (-m mcl), SCPS will invoke MCL in a way that the transformation itself is performed within MCL.
If you use results calculated by clusterx in a publication, please cite one of the following references:
| [1] | (1, 2, 3, 4, 5) Nepusz T, Sasidharan R, Paccanaro A: SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinformatics 11:120, 2010. |
| [2] | Paccanaro A, Casbon JA, Saqi MA: Spectral clustering of protein sequences. Nucleic Acids Res 34(5):1571-80, 2006. |
| [3] | Enright AJ, van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575-84, 2002. |