Charles Blatti (firstname.lastname@example.org), Laura Sloofman, and Saurabh Sinha
For this analysis, 223 representatative transcription factor binding motifs were selected from a collection of motifs characterized in Drosophila with a bacterial one-hybrid system followed by sequencing by FlyFactorSurvey. Transcription factors were selected as representative from clusters defined from motif similarity and needed to have at least one bee ortholog in orthogroups defined by OrthoDB.
For our cis-regulatory analysis, we first produce two different normalized genome-wide scoring profiles for each selected TF motif in each of the bee genomes. The first step of this process is to mask out the tandem repeats in each genome with the Tandem Repeat Finder. Next, each genome is divided into 500 bp windows that overlap by 250 bp. The HMM-based motif scoring program Stubb is run on all selected motifs and all genomes to produce a motif score for every window in all genomes. Stubb was run with a fixed transition probability (0.0025) to the motif state and a background state nucleotide distribution learned from 5kbp centers of gene deserts (regions without coding features of length at least 22 kbp) of the corresponding genome. The first way to normalize these genome-wide motif score profiles,"Rank Normalized", is within each species for each motif, to rank normalized the window scores into score from 0 (best) to 1 (worst). The second normalization procedure, "G/C Normalized", considers each window's G/C content because a motif composed of mostly C’s and G’s is expected to find a high Stubb score in a G/C rich window. The procedure separates genomic windows into 20 equal-sized bins based on their G/C content, and performs rank-normalization within each bin separately.
The final stage of our motif-scoring pipeline, is to summarize the score of the motifs at the gene level. For each gene in each species, we calculate a p-value for each motif. This motif p-value for a gene is calculated as Pgm = 1 − (1 − Ngm)^Wg where Ngm is the best normalized score for motif m among the Wg windows that fall within the regulatory region of the gene g. We have 5 possible ways for defining the regulatory region of a gene:
The following files are matrices of gene-level motif binding scores for each of the bee species using combinations of the two different normalization procedures (Rank and G/C) and the five different regulatory region definitions. The rows are genes, the columns motifs, and the scores are the Pgm modified p-values which range from 0 (best) to 1 (worst). Scores of "2" indicate that there is not motif score, likely indicating that the regulatory region was small and masked by tandem repeats.
Kapheim KM, et al. Social evolution: Genomic signatures of evolutionary transitions from solitary to group living. Science. 2015. (PubMed)