Running GFam

Input files

The input files can be grouped into three large groups: data files, mapping files and the configuration file. Data files contain the actual input data that is specific to a given organism. Mapping files usually map between IDs of different data sources (for instance, from InterPro domain IDs to Gene Ontology terms) or IDs to human-readable descriptions. The configuration file tells GFam where to find the data files and the mapping files. When one wants to process a new organism with GFam, it is therefore usually enough to replace the paths of the data files in the configuration only, as the mapping files can be re-used for multiple analyses.

GFam requires the following data files:

Sequence file
This file must contain all the sequences that are being analysed and annotated by GFam.
Domain assignment file
This file is produced by running iprscan, the command-line variant of InterProScan and it assigns sections of each segment in the sequence file to known domains in InterPro. The sequence IDs in this file must be identical to the ones in the sequence file; if not, one can specify a regular expression in the configuration file to extract the sequence ID from the FASTA defline.

Besides the data files, the following mapping files are also needed:

InterPro – GO mapping
This file maps InterPro IDs to their corresponding GO terms, and it can be obtained from <http://www.geneontology.org/external2go/interpro2go>.
Mapping of domain IDs to human-readable names
Fairly self-explanatory; a tab-separated flat file with two columns, the first being the domain ID and the second being the corresponding human-readable name. It is advisable to construct a file which contains at least the InterPro, Pfam, SMART and Superfamily IDs as these are the most common (and many Pfam, SMART and Superfamily IDs do not have corresponding InterPro IDs yet). If you want to create such a mapping file easily, please refer to Updating the mapping of IDs to human-readable names.
Parent-child relationships of InterPro terms
This file contains the parent/child relationships between InterPro accession numbers to indicate family/subfamily relationships. This file is used to map each InterPro subfamily ID to the corresponding family ID, and it can be obtained from EBI.
The Gene Ontology
This file contains the Gene Ontology in OBO format, and it is required only for the label assignment and overrepresentation analysis steps. The latest version of the file can be obtained from the homepage of the Gene Ontology project.

GFam accepts uncompressed files or files compressed with gzip or bzip2 for both the data and the mapping files. Compressed files will be decompressed on-the-fly in memory when needed.

The configuration file

The default configuration file of GFam is called gfam.cfg, but you can specify an alternative configuration file name on the command line using the -c switch. A sample configuration file is included in the GFam distribution; however, you can always generate a new one by running the following command:

$ bin/gfam init

This will generate a file named gfam.cfg in the current directory and list the configuration keys you have to modify before starting your analyses.

The configuration file consists of sections, led by a [section] header and followed by name=value entries. Lines beginning with # or ; are ignored and used to provide comments. Lines containing whitespace characters only are also ignored. For more details about the configuration file format, please refer to the ConfigParser module in the documentation of Python.

The full list of supported configuration keys and their default values is as follows:

Section DEFAULT

file.input.iprscan
Raw output file from IPRScan that contains domain assignments for all the sequences we are interested in.
file.input.sequences
A FASTA file containing all the sequences being analysed.
file.log.iprscan_exclusions
Log file to store why given sequences have been rejected during the filtering of the input IPRScan file. Feel free to leave it empty.
file.mapping.gene_ontology

A file containing the Gene Ontology in OBO format.

Default value: data/gene_ontology.obo

file.mapping.interpro2go

File containing the mapping of GO terms to InterPro entries, as downloaded from geneontology.org.

Default value: data/interpro2go

file.mapping.interpro2name

File containing a tab-separated list of InterPro IDs and their corresponding human-readable descriptions. This can be constructed by the following command:

$ bin/download_names.py | gzip -9 >data/names.dat.gz.

Default value: data/names.dat.gz

file.mapping.interpro_parent_child

File containing the parent-child relationships of InterPro terms.

Default value: data/ParentChildTreeFile.txt

folder.work

The working folder in which to put intermediary files.

Default value: work

folder.output

The output folder in which to put the final results.

Default value: work

sequence_id_regexp
Python regular expression that matches gene IDs from the sequence file. This is necessary if the IPRScan output uses a sequence ID that is only a part of the sequence ID in the input FASTA file. If it is empty, no ID transformation will be done, the sequence IDs in the input FASTA file will be matched intact to the IPRScan output. If it is not empty, it must be a valid Python regular expression with a named group “id” that matches the gene ID that is used in the IPRScan output. If you don’t know what named groups are, check the documentation of the Python re module.
untrusted_sources

Which assignment sources NOT to trust from InterPro? (space separated)

Default value: HAMAP PatternScan FPrintScan Seg Coil

max_overlap

Maximum overlap allowed between assignments originating from the same data source.

Default value: 20

num_cpu_cores

Hint on the number of CPU cores to use during the analysis. Currently the BLAST invocation uses this hint to select the number of threads used by BLAST to speed up calculations.

The default value is 1 since it is not possible to auto-detect the number of CPU cores in a platform independent way. Feel free to raise this value if your computer has multiple CPU cores.

Default value: 1

Section analysis:blast_filter

min_seq_identity

Minimum sequence identity between two fragments to consider them as being in the same domain.

Default value: 45

min_alignment_length

Minimum alignment length between two fragments to consider them as being in the same domain. If normalization_method is not off, this must be the normalized alignment length threshold according to the chosen normalization method.

Default value: 0.7

max_e_value

Maximum E-value between two fragments in order to consider them as being in the same domain.

Default value: 1e-3

normalization_method

Normalization method to use for calculating normalized alignment length Must be one of: off, smaller, larger, query, hit.

Default value: query

Section analysis:find_domain_arch

min_novel_domain_size

A novel domain occur in at least this number of sequences.

Default value: 4

Section analysis:find_unassigned

min_seq_length

Minimum number of amino acids in a sequence in order to consider it further (i.e. to calculate its unassigned fragments)

Default value: 30

min_fragment_length

Minimum number of amino acids in a sequence fragment in order to consider that fragment as a novel domain candidate.

Default value: 75

sequences_file

Input FASTA file containing all the sequences of the representative gene model being analysed.

Default value: same as file.input.sequences

Section analysis:iprscan_filter

e_value_thresholds

E-value thresholds to use when processing the initial InterPro file. This entry is a semicolon-separated list of source=threshold pairs, the given threshold will be used for the given data source. If an entry contains only a threshold, this will be considered as a default threshold for sources not explicitly mentioned here.

Default value: 1e-3;superfamily=inf;HMMPanther=inf;Gene3D=inf;HMMPIR=inf

interpro_parent_child_mapping

File containing the parent-child relationships of InterPro terms.

Default value: same as file.mapping.interpro_parent_child

stages.1

These configuration keys specify which assignment sources are to be taken into account at each stage of the analysis. For more information about what these stages are, please refer to the documentation, especially the description of Step 2 in section “Steps of the GFam pipeline”

Default value: ALL-HMMPanther-Gene3D

Section analysis:jaccard

min_similarity

Minimum Jaccard similarity between the neighbour sets of two fragments in order to consider them as being in the same domain.

Default value: 0.66

assume_loops

Whether to assume that a protein is connected to itself or not.

Default value: 1

only_linked

Whether to consider only those protein pairs which are linked in the input file. If this is 1, protein pairs not in the input file will not be returned even if their Jaccard similarity is larger than the given threshold.

Default value: 1

Section analysis:overrep

confidence

The p-value threshold in the hypergeometric test used in the overrepresentation analysis process.

Default value: 0.05

correction

The method used to account for multiple hypothesis testing. Valid choices: bonferroni (Bonferroni correction), Sidak (Sidak correction), fdr (Benjamini-Hochberg method), none (off). The Bonferroni and Sidak correction methods control the family-wise error rate (FWER), while the Benjamini-Hochberg method control the false discovery rate.

Default value: fdr

min_term_size

The minimum number of annotated domains a GO term must have in order to be considered in the overrepresentation analysis.

Default value: 1

Section generated

file.unassigned_fragments

File in which the unassigned sequence fragments are stored.

Default value: %(folder.work)s/unassigned_fragments.ffa

file.valid_gene_ids

File containing a list of valid gene IDs (extracted from the input file)

Default value: %(folder.work)s/gene_ids.txt

file.domain_architecture_details

File containing the detailed final domain architecture for each sequence.

Default value: %(folder.output)s/domain_architecture_details.txt

file.domain_architecture_stats

File containing genome-level domain architecture statistics.

Default value: %(folder.output)s/domain_architecture_stats.txt

Section utilities

folder.blast
The folder containing the BLAST executables. It does not matter whether you have the old C-based or the newer C++-based tools, GFam can use both if you also have the legacy_blast.pl script that adapts the new tools to the command line syntax used by the older ones.
util.formatdb

The path to formatdb. You may use the name of the folder containing the BLAST executables or the full path (including the name of the tool). If you have the newer, C++-based BLAST tools (which do not have formatdb), pass the name of the folder containing the BLAST executables here, and if you have the legacy_blast.pl script in the same folder (plus a working Perl setup), GFam will detect the situation and run legacy_blast.pl accordingly.

Default value: same as folder.blast

util.blastall

The path to blastall. You may use the name of the folder containing the BLAST executables or the full path (including the name of the tool). If you have the newer, C++-based BLAST tools (which do not have blastall), pass the name of the folder containing the BLAST executables here, and if you have the legacy_blast.pl script in the same folder (plus a working Perl setup), GFam will detect the situation and run legacy_blast.pl accordingly.

Default value: same as folder.blast

Output files

GFam produces four output files in the output folder specified in the configuration file. These files are as follows:

domain_architectures.tab

A simple tab-separated flat file that contains the inferred domain architecture for each sequence in a simple, summarised format. The file is sorted in a way such that more frequent domain architectures are placed at the top. Sequences having the same domain architecture are sorted according to their IDs.

The file has six columns. The first column is the ID of the sequence (e.g., AT1G09650.1), the second is the sequence length (e.g, 382). The third column contains a summary of the domain architecture of the sequence, where domains are ordered according to the starting position, and consecutive domain IDs are separated by semicolons (e.g., IPR022364;IPR017451). The InterPro domain ID is used whenever possible. Novel domains identified by GFam are denoted by NOVELxxxxx, where xxxxx is a five-digit identifier. The fourth column contains the frequency of this domain architecture (i.e. the number of sequences that have the same domain architecture). The fifth column is the same as the third, but the exact starting and ending positions of the domain are also added in parentheses after the domain ID (e.g., IPR022364(9-57);IPR017451(112-357)). The sixth column contains the concatenated human-readable descriptions of the domains (for instance, F-box domain, Skp2-like;F-box associated interaction domain).

domain_architecture_details.txt

This file is the human-readable variant of domain_architectures.tab (which is more suitable for machine parsing). It contains blocks separated by two newline characters; each block corresponds to a sequence and has the following format:

AT1G09650.1
    Primary assignment source: HMMTigr
    Coverage: 0.772
    Coverage w/o novel domains: 0.772
       9-  57: SSF81383 (superfamily, stage: 2) (InterPro ID: IPR022364)
               F-box domain, Skp2-like
     112- 357: TIGR01640 (HMMTigr, stage: 1) (InterPro ID: IPR017451)
               F-box associated interaction domain

The first line of each block is unindentend and contains the sequence ID. The remaining lines are indented by at least four spaces. The second line contains the name of the InterPro data source that was used to come up with the primary assignment in step 2 of the pipeline (see more details later in Steps of the GFam pipeline). The third and the fourth lines contain the fraction of positions in the sequence that are covered by at least one domain; the third line takes into account novel domains (NOVELxxxxx), while the fourth line does not. The remaining lines list the domains themselves along with the data source they came from and the stage in which they were selected. For more details about the stages, see Steps of the GFam pipeline.

assigned_labels.txt

TODO

overrepresentation_analysis.txt

This file contains the results of the Gene Ontology overrepresentation analysis for the domain architecture of each sequence. Note that since the results of the overrepresentation analysis depend only on the domain architecture, the results of sequences having the same domain architecture will be completely identical.

The file consists of blocks separated by two newlines, and each block corresponds to one sequence. Each block has the following format:

AT1G61040.1
  0.0009: GO:0016570 (histone modification)
  0.0009: GO:0016569 (covalent chromatin modification)
  0.0024: GO:0016568 (chromatin modification)
  0.0036: GO:0006325 (chromatin organization)
  0.0049: GO:0051276 (chromosome organization)
  0.0055: GO:0006352 (transcription initiation)
  0.0095: GO:0006461 (protein complex assembly)
  0.0109: GO:0065003 (macromolecular complex assembly)
  0.0111: GO:0006996 (organelle organization)
  0.0126: GO:0043933 (macromolecular complex subunit organization)

In each block, the first number is the p-value obtained from the overrepresentation analysis, the second column is the GO ID. The name corresponding to the GO label is contained in parentheses. Blocks containing a sequence ID only represent sequences with no significant overrepresented GO labels in their domain architecture.

Command line options

GFam is started by the master script in bin/ as follows:

$ bin/gfam

The exact command line syntax is bin/gfam [options] [command], where command is one of the following:

init

Generates a configuration file for GFam from scratch. The name of the configuration file will be gfam.cfg by default, but you can change it with the -c switch. GFam will refuse to overwrite existing configuration files. Example:

$ bin/gfam -c a_lyrata.cfg init
run
Runs the whole GFam pipeline. This is the default command.
clean

Removes the temporary directory used to store the intermediate results. The name of the temporary directory is determined by the folder.work configuration option in the configuration file.

Warning

If the output directory is the same as the temporary directory (folder.work is equal to folder.output in the configuration), the clean command will also delete the final results from the output folder!

The default configuration file used is always gfam.cfg, but it can be overridden with the -c switch. For example, the following command will clean the work directory specified in a_lyrata.cfg:

$ bin/gfam -c a_lyrata.cfg clean

The following extra command line switches are also available:

-h, --help shows a help message and then exits
-c FILE, --config-file=FILE
 specifies the name of the configuration FILE
-v, --verbose enables verbose logging
-d, --debug shows debug messages as well
-f, --force forces the recalculation of the results of intermediary steps in the GFam pipeline even when GFam thinks everything is up-to-date.

Besides the master script, there are scripts for re-running individual steps of the GFam pipeline. These scripts are separate Python modules in gfam/scripts and they correspond to the steps of the GFam pipeline. It is unlikely that you will have to run them by hand, but if you do, you have to supply the necessary input on the standard input stream of the scripts. For instance, if you want to do some custom filtering on a BLAST tabular result file, you can use gfam/scripts/blast_filter.py as follows:

$ python -m gfam.scripts.blast_filter -e 1e-5 <input.blast

This will filter input.blast and remove all entries with an E-value larger than 10-5. The result will be written to the standard output.

You can get a summary of the usage of each script in gfam/scripts as follows:

$ python -m gfam.scripts.blast_filter --help

Of course replace blast_filter with the name of the script you are interested in. The default values of the command line switches of these scripts come from the configuration file, and they also support -c to change the name of the configuration file.

In 99.9999% of the cases, you will only have to do bin/gfam init to create a new configuration file, bin/gfam to run the pipeline and bin/gfam clean to clean up the results.