Help for the Fitness Browser

The Fitness Browser was developed by the Arkin lab. It displays thousands of genome-wide fitness assays from the Deutschbauer lab, the Arkin lab, and collaborators.

How it works

The fitness data is collected using randomly barcoded transposons (RB-TnSeq). Each fitness experiment is based on a pool of 30,000 to 500,000 mutant strains. Every mutant strain has a transposon inserted at a random location in the genome, and each transposon includes a random barcode that allows us to track the abundance of that strain by using PCR followed by DNA sequencing ("BarSeq"). To link the barcode to the location in the genome, we use a more complicated TnSeq-like protocol.

For each fitness experiment, we compare the abundance of each strain at the end of the experiment to its abundance at the beginning. The beginning sample is also referred to as the "Time0" sample. Typically, we recover the pool of mutants from the freezer in rich media, wash the cells and take Time0 sample(s), and transfer the washed cells into many different tubes or wells. Thus, many different conditions may be compared to the same Time0 sample(s).

For details, see our methods paper (Wetmore et al, mBio 2015).

Gene fitness

Fitness values are log2 ratios that describe the change in abundance of mutants in that gene during the experiment. For most of the fitness experiments, which are growth experiments, the change reflects how well the mutants grow. Fitness = 0 means that mutants in this gene grew well as other mutants and probably about as well as wild type strains. Fitness < 0 means that the gene was important for fitness and the mutants were less abundant at the end of the experiment than at the beginning. For example, fitness = -1 means that mutants in the gene were half as abundant at the end of the experiment, compared to the beginning. Fitness > 0 means that the gene was detrimental to fitness and that mutants had a growth advantage.

In general, if -1 < fitness < 1, then the gene has a subtle phenotype that might be statistically significant (see t scores) but will probably be difficult to interpret. Fitness < -2 or fitness > 2 are strong fitness effects. In the typical experiment, the pool of mutants doubles 4-8 times, so in principle, a conditionally essential gene should have fitness of -4 to -8. However, it is not possible to tell the difference between little or no growth with a pooled assay. (Also, very low fitness values are more noisy because they are based on a log2 ratio with a small numerator – in the typical experiment, a fitness value of -1 is reliably different from 0, but -5 is not reliably different from -4.)

More rigorously, gene fitness is the weighted average of strain fitness, across strains that has a transposon inserted within that gene. A strain's fitness is the log2 ratio of abundance at the end of the experiment compared to its abundance at the beginning of the experiment, where we use the number of reads for each strain's barcode as a proxy for its abundance. The gene fitness is normalized so that the typical gene has a fitness of zero. For genes on large chromosomes, the gene fitness values are also normalized for changes in copy number along the chromosome.

Although most experiments are based on growth, this site also includes assays of motility or survival. For a motility assay, the experimental samples might be the cells that reached the outer ring of an agar plate, or that stayed in the inner ring where the cells were originally placed. For a survival assay, the cells are stressed or starved for a period of time; then, to distinguish viable cells from dead cells, all cells are transferred to a rich medium and recovered for a few generations.

t scores

The t-like test statistic indicates how reliably a gene fitness values is different from zero. Ideally, they are on the same scale as z scores or t scores. However, since there are thousands of genes in each experiment, and there can be hundreds of fitness experiments for a gene, a higher threshold is needed. We usually ignore any fitness effects with |t| < 4. In most cases, you can gain confidence in a fitness effect by comparing the phenotype of a gene in replicate experiments, or in similar experiments (such as different concentrations of the same inhibitory compound), or for orthologous genes.

Cofitness

Cofitness(gene 1, gene 2) is the linear (Pearson) correlation of their fitness patterns. Alternatively, if two genes in the same organism have similar fitness patterns, then we say that they are cofit.

If two genes have similar fitness patterns (cofitness > 0.75), and they are among the most cofit genes (rank = 1 or rank = 2), then they are likely to function in the same pathway. For genes with strong fitness patterns, often the most cofit genes are other genes in the same operon, so we look a little farther down the list to find genes that may have related functions.

Conserved cofitness: If two genes have cofitness > 0.6, and their orthologs have cofitness > 0.6, then this is stronger evidence of a functional relationship.

If we have relatively little data for an organism, then cofitness results will not be available for any of its genes.

Specific phenotypes

We define a gene as having a "specific" phenotype in a condition if the gene has a stronger phenotype in this condition than in most other conditions, and lacks phenotypes in most conditions. More precisely, we require
  • |fit| > 1
  • |t| > 5
  • |fit|95 < 1, where |fit|95 is the 95th percentile of |fit| across all experiments for this gene
  • |fit| > |fit95| + 0.5
If we have relatively little data for an organism, then there may not be any specific phenotypes for any of its experiments. Also, these criteria are stringent and may miss some genes.

Orthologs

We use "orthologs" to refer to similar proteins in different organisms that may carry out the same function, without regard to their evolutionary history. Thus they are putative functional orthologs, not evolutionary orthologs. The "orthologs" in this web site are bidirectional best hits from protein BLAST. We also require that the BLAST alignment cover 80% of each protein.

Many of these "orthologs" actually have different functions. If either gene has a strong fitness pattern, you may be able to use conserved phenotypes or conserved cofitness to confirm that the genes have conserved functions and are truly functional orthologs.

Protein sequence analysis

For each protein, the Fitness Browser includes:
  • PFam domains, computed with HMMer3
  • TIGRFam domains or families, computed with HMMer3
  • The best hit to KEGG, computed with RAPSearch2 and minimum 80% coverage and 30% identity
  • The best hit to Swiss-Prot (the curated part of UniProt), computed with RAPSearch2 and minimum 80% coverage and 30% identity
  • The best hit to annotated enzymes in MetaCyc, computed with RAPSearch2 and minimum 80% coverage and 30% identity.
  • The SEED annotation, computed with the SEED API

Information from TIGRFam, KEGG, and SEED is used to link proteins to enzyme commision (EC) numbers and hence to metabolic maps (from the last public release of KEGG).

Fitness Browser includes links to other analysis tools (see the protein page) as well as a homologs page (computed using BLAST).

Linking to the Fitness Browser

Most of the genomes in this web site were taken from NCBI (i.e., gene identifiers are locus tags and scaffolds are Genbank accessions) or from MicrobesOnline (i.e., gene identifiers and scaffold identifiers are numbers). These identifiers should be stable over time, so URLs from the web site should continue to work in the long run. For example, to link to the fitness data for endA from E. coli, you can use
http://fit.genomics.lbl.gov/cgi-bin/singleFit.cgi?orgId=Keio&locusId=17024&showAll=0

Or, you can use Fitness BLAST to link from any protein sequence to the homologs that have fitness data. You can incorporate this into your web page with just a few lines of code.

Or, you can use Fitness BLAST for genomes to identify orthologs in our data set for an entire genome at once. It takes less than a minute and we plan to store the results indefinitely.

(Both Fitness BLAST and Fitness BLAST for genomes are powered by usearch, not BLAST. However, single sequence search and the homologs page rely on BLAST.)

About the code

The code for this web site is freely available at bitbucket.org. The code was written by Morgan Price, Victoria Lo, and Wenjun Shao in the Arkin lab.

References

  • Wetmore et al 2015 -- carbon source experiments for Escherichia coli BW25113, Shewanella oneidensis MR-1, Shewanella amazonensis SB2B, Phaeobacter inhibens BS107, and Pseudomonas stutzeri RCH2
  • Rubin et al 2015 -- the mutant library for Synechococcus elongatus PCC 7942

Most of the data is not published. Contact Adam Deutschbauer for more information about the unpublished data.

Funding

This site was developed by ENIGMA - Ecosystems and Networks Integrated with Genes and Molecular Assemblies, a Scientific Focus Area Program at Lawrence Berkeley National Laboratory, and supported by the U.S. Department of Energy, Office of Science, Office of Biological & Environmental Research under contract number DE-AC02-05CH11231.