Help for the Fitness Browser
The Fitness Browser was developed by the Arkin lab. It displays thousands of genome-wide fitness assays from the Deutschbauer lab, the Arkin lab, and collaborators.
How it works
The fitness data is collected using randomly barcoded transposons (RB-TnSeq). Each fitness experiment is based on a pool of 30,000 to 500,000 mutant strains. Every mutant strain has a transposon inserted at a random location in the genome, and each transposon includes a random barcode that allows us to track the abundance of that strain by using PCR followed by DNA sequencing ("BarSeq"). To link the barcode to the location in the genome, we use a more complicated TnSeq-like protocol.
For each fitness experiment, we compare the abundance of each strain at the end of the experiment to its abundance at the beginning. The beginning sample is also referred to as the "Time0" sample. Typically, we recover the pool of mutants from the freezer in rich media, wash the cells and take Time0 sample(s), and transfer the washed cells into many different tubes or wells. Thus, many different conditions may be compared to the same Time0 sample(s).
For details, see our methods paper (Wetmore et al, mBio 2015).
Gene fitness
Fitness values are log2 ratios that describe the change in abundance of mutants in that gene during the experiment. For most of the fitness experiments, which are growth experiments, the change reflects how well the mutants grow. Fitness = 0 means that mutants in this gene grew well as other mutants and probably about as well as wild type strains. Fitness < 0 means that the gene was important for fitness and the mutants were less abundant at the end of the experiment than at the beginning. For example, fitness = -1 means that mutants in the gene were half as abundant at the end of the experiment, compared to the beginning. Fitness > 0 means that the gene was detrimental to fitness and that mutants had a growth advantage.
In general, if -1 < fitness < 1, then the gene has a subtle phenotype that might be statistically significant (see t scores) but will probably be difficult to interpret. Fitness < -2 or fitness > 2 are strong fitness effects. In the typical experiment, the pool of mutants doubles 4-8 times, so in principle, a conditionally essential gene should have fitness of -4 to -8. However, it is not possible to tell the difference between little or no growth with a pooled assay. (Also, very low fitness values are more noisy because they are based on a log2 ratio with a small numerator – in the typical experiment, a fitness value of -1 is reliably different from 0, but -5 is not reliably different from -4.)
More rigorously, gene fitness is the weighted average of strain fitness, across strains that has a transposon inserted within that gene. A strain's fitness is the log2 ratio of abundance at the end of the experiment compared to its abundance at the beginning of the experiment, where we use the number of reads for each strain's barcode as a proxy for its abundance. The gene fitness is normalized so that the typical gene has a fitness of zero. For genes on large chromosomes, the gene fitness values are also normalized for changes in copy number along the chromosome. For details see here (especially in "BarSeq data analysis and calculation of gene fitness" and figure S4).
Although most experiments are based on growth, this site also includes assays of motility or survival. For a motility assay, the experimental samples might be the cells that reached the outer ring of an agar plate, or that stayed in the inner ring where the cells were originally placed. For a survival assay, the cells are stressed or starved for a period of time; then, to distinguish viable cells from dead cells, all cells are transferred to a rich medium and recovered for a few generations.
t scores
The t-like test statistic indicates how reliably a gene fitness values is different from zero. Ideally, these scores are on the same scale as z scores or t scores. However, since there are thousands of genes in each experiment, and there can be hundreds of fitness experiments for a gene, a higher threshold is needed. We usually ignore any fitness effects with |t| < 4. In most cases, you can gain confidence in a fitness effect by comparing the phenotype of a gene in replicate experiments, or in similar experiments (such as different concentrations of the same inhibitory compound), or for orthologous genes. The t-like scores are described here (especially in "t-like test statistic" and figure S5).Cofitness
Cofitness(gene 1, gene 2) is the linear (Pearson) correlation of their fitness patterns. If two genes in the same organism have similar fitness patterns, then we say that they are cofit.
If two genes have similar fitness patterns (cofitness > 0.75), and they are among the most cofit genes (rank = 1 or rank = 2), then they are likely to function in the same pathway. For genes with strong fitness patterns, often the most cofit genes are other genes in the same operon, so we look a little farther down the list to find genes that may have related functions.
Conserved cofitness: If two genes have cofitness > 0.6, and their orthologs have cofitness > 0.6, then this is also evidence of a functional relationship.
If we have relatively little data for an organism, then cofitness results will not be available for any of its genes.
Specific phenotypes
We define a gene as having a "specific" phenotype in a condition if the gene has a stronger phenotype in this condition than in most other conditions, and lacks phenotypes in most conditions. More precisely, we require- |fit| > 1
- |t| > 5
- |fit|95 < 1, where |fit|95 is the 95th percentile of |fit| across all experiments for this gene
- |fit| > |fit95| + 0.5
Orthologs
We use "orthologs" to refer to proteins in different organisms that have similar sequences and may carry out the same function, without regard to their evolutionary history. Thus they are putative functional orthologs, not evolutionary orthologs. The "orthologs" in this web site are bidirectional best hits from protein BLAST. We also require that the BLAST alignment cover 80% of each protein.
Many of these "orthologs" actually have different functions. If either gene has a strong fitness pattern, you may be able to use conserved phenotypes or conserved cofitness to confirm that the genes have conserved functions and are truly functional orthologs.
Protein sequence analysis
For each protein, the Fitness Browser includes:- PFam domains, computed with HMMer3
- TIGRFam domains or families, computed with HMMer3
- The best hit to KEGG, computed with RAPSearch2 and minimum 80% coverage and 30% identity
- The best hit to Swiss-Prot (the curated part of UniProt), computed with RAPSearch2 and minimum 80% coverage and 30% identity
- The best hit to annotated enzymes in MetaCyc, computed with RAPSearch2 and minimum 80% coverage and 30% identity.
- The SEED annotation, computed with the SEED API
Information from TIGRFam, KEGG, and SEED is used to link proteins to enzyme commision (EC) numbers and hence to metabolic maps (from the last public release of KEGG). The EC numbers are also used to link proteins to MetaCyc pathways, along with best hits to MetaCyc.
Fitness Browser includes links to other analysis tools (see the protein tab) as well as a homologs tab that shows the top homologs that we have fitness data for (as found by protein BLAST). For advice on how to use all these tools together, see Interactive tools for functional annotation of bacterial genomes.
Growth
For most of the experiments, optical density was used to estimate the number of cells at the start of the experiment and at the end. The optical density at the start of the experiment is usually measured indirectly from the OD a more concentrated sample before inoculation. The log2 ratio of the end OD versus the start OD gives an estimate of the number of generations of growth during the fitness assay. The number of generations might be underestimated for experiments that reached a high density. For example, an experiment with 6 generations might be estimated as 5 generations. In this paper, we described how we calibrated some of the OD measurements, but those calibrations are not reflected in the Fitness Browser.
For experiments that do not have a measurement of the final optical density, you may be able to use the scale of the fitness values to get a rough estimate of the number of generations. In principle, genes that are absolutely required for growth should have fitness -n, where n is the number of generations. This is more likely to work for defined media experiments, where many mutants that are present in the library have strong fitness defects.
Linking to the Fitness Browser
Most of the genomes in this web site were taken from NCBI (i.e., gene identifiers are locus tags and scaffolds are Genbank accessions) or from MicrobesOnline (i.e., gene identifiers and scaffold identifiers are numbers). These identifiers should be stable over time, so URLs from the web site should continue to work in the long run. For example, to link to the fitness data for endA from E. coli, you can usehttp://fit.genomics.lbl.gov/cgi-bin/singleFit.cgi?orgId=Keio&locusId=17024&showAll=0
Or, you can use Fitness BLAST to link from any protein sequence to the homologs that have fitness data. You can incorporate this into your web page with just a few lines of code.
(Fitness BLAST is powered by usearch, not BLAST. However, single sequence search and the homologs page rely on protein BLAST.)
Data Downloads
- You can download tables for each organism from the links at the bottom of each organism's page.
- For the current version of the Fitness Browser, you can download all protein sequences or the main sqlite3 database (large! ~5 GB as of July 2017).
- The June 2017 release of the Fitness Browser is available in its entirety here.
- You can download all of the reannotations here (tab-delimited, and includes protein sequences)
About the code
The code for this web site is freely available at bitbucket.org. The code was written by Morgan Price, Victoria Lo, and Wenjun Shao in the Arkin lab.
Funding
This site was developed by ENIGMA - Ecosystems and Networks Integrated with Genes and Molecular Assemblies, a Scientific Focus Area Program at Lawrence Berkeley National Laboratory, and supported by the U.S. Department of Energy, Office of Science, Office of Biological & Environmental Research under contract number DE-AC02-05CH11231.