A desk study about discovery of gamma-proteobacteria as well as its contribution on nitrogen fixation was conducted as an initial step of the research project. It was first recovered in South China Sea by Moisander et al (2008).After gaining a basic understanding of heterotrophic bacteria, molecular analysis of nifH genomic sequences were conducted on putative active heterotrophic diaotrophs that collected from different studies which exclusively used 454 pyrosequencing or illumina sequencing of DNA samples. Databases of this study was built with collection of studies from global marine surface (Farnelid et al.

, 2011), central Baltic Sea (Farnelid et al., 2013), Arafura and Timor Sea (Messer et al., 2016), Eastern tropical North Pacific Ocean (Cheung et al., 2016), and North Pacific Ocean (Shizaki et al., 2017). There are totally 94 DNA samples and 45 RNA samples.Sequence data (both DNA and RNA) of nifH genes were downloaded in fasta-formatted sequence files based on the accession number from the National Center for Biotechnology Information Sequence Read Archive.

Sequence quality control was conducted with Mothur. The sequences were clipped at 300 base pairs (bp) and filtered to only include sequences with correct tag and primer sequences (Farnelid et al., 2013). Low-quality sequences (quality score<25), short sequences (<300bp) and homopolyers containing sequences (homopolyer bases>8) were removed (Cheung et al., 2016). After aligning with the nifH reference database, the sequences with poor alignment that contain ambiguous bases and chimeric sequences were removed. High-quality sequences were then de-noised with 0.

01 sigma value to minimize the negative impact of PCR bias. Since sequences were scattered in multiple files from different studies, they were then merged and assembled in a new group. Sequences produced by pyrosequencing errors were removed, which also provides advantages for rapid calculation of phylogenetic distance calculation between aligned sequences. The sequences were clustered into operational taxonomic units (OTUs) at 0.20 cutoff value, which share 80% sequence identity.    Since top OTUs illustrated with a high similarity in sequences indicate importance of conservative genes, it is necessary to study the dominant OTUs on expression level.  Representative sequences were translated to amino acid (aa) sequences by using the FRAMEBOT online pipeline (Wang et al., 2013).

The aa sequences were then aligned with the protein sequences databases on NCBI by using protein BLAST (blastp). Two trail runs were carried out whether include uncultured or environmental sample sequences. Performance of uncultured and referenced strains were selected based on the value of query cover and identity, and those with low identity score were removed.

The OTU representative sequences and the selected reference sequences from the NCBI protein sequence database were then aligned with ClustalW in MEGA7.0. A neighboring-joining phylogenetic tree was constructed with MEGA7.0 in the basis of ClustalW alignment result.



