There are four directories accommodate data adopted for this work. 

Directory name: Resource, BasicDatasets, SynoTable, Tfile. 

Throughout all the datasets, investigated synonymous codons have consistent display order in their own synonymous codon families. Show as table below: 
'synonymous codon family table':
Glu ={'GAG','GAA'}; 
His ={'CAT','CAC'};
Gln ={'CAG','CAA'};
Phe ={'TTT','TTC'};
Tyr ={'TAT','TAC'}; 
Cys ={'TGT','TGC'};
Asn ={'AAT','AAC'}; 
Lys ={'AAG','AAA'}; 
Asp ={'GAT','GAC'};
Ile ={'ATA','ATT','ATC'}; 
Pro ={'CCG','CCA','CCT','CCC'}; 
Thr ={'ACG','ACA','ACT','ACC'};  
Ala ={'GCG','GCA','GCT','GCC'}; 
Val ={'GTG','GTA','GTT','GTC'}; 
Gly ={'GGG','GGA','GGT','GGC'};
Leu ={'TTG','TTA','CTG','CTA','CTT','CTC'};  
Ser ={'AGT','AGC','TCG','TCA','TCT','TCC'};   
Arg ={'AGG','AGA','CGG','CGA','CGT','CGC'};   

Amino acid abbreviations show as below:
Glu:E; His:H; Gln:Q; Phe:F; Tyr:Y; Cys:C; Asn:N; Lys:K; Asp:D; Ile:I; Pro:P; Thr:T; Ala:A; Val:V; Gly:G; Leu:L; Ser:S; Arg:R.
   
Files under each directory:
(1) In directory 'Resource', there is a file 'speciesList.xlsx'. In this file there are 3 worksheets 'fungi','bacteria','protist'. Each worksheet has two columns and row size is equal to species amount under investigation in that kingdom. Header displays 'species name, weblink' and each row reports species name and corresponding genome data resource weblink. 

(2) In directory 'BasicDatasets', filenames follow the pattern: 'species name' + 'amino acid type' + ratio+ 'kingdom type'. Species name refers to 'specisList.xlsx'. Amino acid type includes: 'GluE', 'HisH', 'GlnQ', 'PheF', 'TyrY', 'CysC', 'AsnN', 'LysK', 'AspD', 'IleI', 'ProP', 'ThrT', 'AlaA', 'ValV', 'GlyG', 'LeuL', 'SerS', 'ArgR'. Kingdom type includes: Fg (abbreviation for 'fungi'), Pt (abbreviation for 'protist'), Bc (abbreviation for 'bacteria'). Row size is the same as the gene amount within the investigated genome. Each row of files in this directory reports 'gene name', 'codon occurrences in each synonymous codon family' 'subsequence length'. Header displays exact nucleotide sequence for each codon.
  

(3) In directory 'SynoTable', there are files:  'synoTableBacteria.txt', 'synoTableFungi.txt', 'synoTableProtist.txt', which display genome wide codon usage table for 3 kingdoms. Each file contains 60 columns and row size is equal to species amount in the investigated kingdom. Each row reports 'species name', 'codon usage ratio in each synonymous codon family'.  Within file, header 'speciesName,E(GAG,GAA),H(CAT,CAC),Q(CAG,CAA),F(TTT,TTC),Y(TAT,TAC),C(TGT,TGC),N(AAT,AAC),K(AAG,AAA),D(GAT,GAC),I(ATA,ATT,ATC),P(CCG,CCA,CCT,CCC),T(ACG,ACA,ACT,ACC),A(GCG,GCA,GCT,GCC),V(GTG,GTA,GTT,GTC),G(GGG,GGA,GGT,GGC),L(TTG,TTA,CTG,CTA,CTT,CTC),S(AGT,AGC,TCG,TCA,TCT,TCC),R(AGG,AGA,CGG,CGA,CGT,CGC)' displays synonymous codon family in order. 


Each codon usage proportion among its own synonymous codon family is calculated by summarising codon occurrences through all genes within the interested genome adopting data source in 'BasicDatasets' directory. Perform such procedures to all species of each kingdom, we obtain genome wide codon usage table for each of the three kingdoms. 

(4) In directory 'Tfile', filenames follow the pattern: 'species name' + For + 'kingdom type' + 'group type'. Species name refers to 'specisList.xlsx'. Kingdom type includes: Fg (abbreviation for 'fungi'), Pt (abbreviation for 'protist'), Bc (abbreviation for 'bacteria'). Group type includes: Tb (observed genome), Tab (control group genome). Each file contains 7 columns and row size is equal to 18 times the gene amount and then minus the amount of non existed amino acid within each gene. Each row reports 'amino acid', 'subsequence length', probability of investigated codon usage configuration, maximum probability of codon usage configuration with a certain length, 'geneID' (investigated gene position in genome file), occurrence of first codon type among its codon family (order refers to 'synonymous codon family table' mentioned above), which corresponds to the header 'aa,L,Pi,Pmax,GeneId,codon1'.

When given an underlying codon position access propensity P(AA)=[P1,P2,...,Pm] (m is the size of synonymous codon family). In our case P(AA) is posed according to synonymous codon ratio in the genome wide codon usage table. Given P(AA), in a subsequence (length=L),a particular configuration of codon occurrences C(AA)=[C1,C2,...,Cm] has a corresponding multinomial distribution probability Pi. Pi = (L!/C1!C2!...Cm!)(P1^(C1))(P2^(C2))...(Pm^(Cm)). Meanwhile we can find the maximum probability Pmax corresponding to subsequence length L, namely the maximum value among all the possible Pi values.

For better understanding of non-randomness of observed genomes, we generate substituted artificial genomes as the control sets. Substituted genome is generated by replacing each codon with the members in the same synonymous codon family with a weighted propensity (complying with genome wide codon usage table).

Information contained in this directory employed datasource in directories 'BasicDatasets' and 'SynoTable'.