Application of Efficient Express Sequence Tags Information for Classification and Functional Study of Simple Sequence Repeats in Cattle Testis Tissue

Genomic markers play an important role in tracing the flow of genetic causality of observable signals in animals and plants. In farm animals, the participation of male animals in the gene pool of subsequent generations are much higher than female animals and testes are the most important organs of the male reproductive system. This study was conducted to investigate simple sequence repeats (SSR) within the expressed sequence tags (ESTs) in order to classify the Bos taurus testis tis sue’s genes for their relationship and specificity with related reproductive domains. A total of 48,549 publicly available EST sequences from cattle testis tissue downloaded from GenBank database, out of which, 10,237 sequences that their library made from testis tissue were extracted and specialized as the studied sequences using several searching tools and software. Across these selective sequences, 2,039 contigs, 5,097 singletons, and 153 SSRs were detected. EST-SSRs were subsequently evalu ated using GenBank and categorized based on their functions in biological systems of dairy cattle. Investigation of these motifs showed that the identified EST-SSRs can be classified into 48 types that GT in dinucleotides and GCC in trinucleotides had the highest frequency. Annotation and gene on tology analysis revealed a relationship among 54 domains with the observed SSRs. Localization and characterization of such markers can help tracing the production of amino acids coded by identified repeats as shown in this study.


INTRODUCTION
Biological and structural markers play important roles in selection processes of breeding animals and plants. These markers can speed up the breeding process and shorten the period to achieve breeding goals (Wang et al., 2014). An enormous amount of genomic and gene expression data are additionally giving chance to create a new generation of molecular markers by using the novel accessible sequences (Ellis & Burke 2007). However, identification and classification of markers in terms of their biological functions, is still a challenge in genomic studies era (Qian et al., 2018).
Molecular markers based on polymerase chain reaction (PCR) such as Simple Sequence Repeats (SSRs) are one of the most common markers in genomic analysis. SSRs are the tandem-repeated (about 5-50 in most cases) sequences of one to six (or more) base pairs and have random distribution within a genome. These sequences are dispersed in prokaryotic and eukaryotic genomes and can be observed in both the coding and non-coding sequences (Riar et al., 2011). Investigating the simple sequence repeats (SSR) within the expressed sequence tags (ESTs) have some intrinsic benefits. EST-SSRs can be rapidly achieved by electronic sorting and are highly transferable to related taxa. Moreover, they have advantages over molecular markers including higher frequency, capable of being highly reproducible, uniformly dispersed across genome, high rates of interspecific transferability across all species/genera, and are multi allelic (Gupta & Varshney, 2000).
The SSR markers have been applied in markerassisted selection (Kaur et al., 2015), molecular mapping (Kirungu et al., 2018), assessment of genetic relationships (Huson et al., 2015), finding of the polymorphisms across species (Yan et al., 2017), relating the phenotypes to genotypes (Kalyana Babu et al., 2014), and finally as an efficient tool to link between evolutionary and population genetic studies. EST sequences are short sub-sequences of cDNA. Usually, the expressed sequence tags (ESTs) containing SSRs located in the coding regions of genome are major concern in genetic studies because of their involvement in coding of amino acids and the functions of organs or tissues (Varshney et al., 2002). They may be applied in physical mapping techniques (Bhattacharjee et al., 2018), determination of gene expression (Ma et al., 2012), and sequence comparisons between normal and cancer tissues (Pu et al., 2013). The use of EST sequences has advantages such as the rapid identification of expressed genes, identification of gene families, phylogenetic analyses, survey of developmentally regulated genes, and examination of strain diversity (Li et al., 2003).
Luckily, Public databases have brought ESTs accessible as DNA-markers for practical application. Such markers can be more helpful than SSRs from unexpressed chromosome regions (Duan et al., 2013). Thus, they may deliver information to associate the complex traits phenotypes with their genetic references. The huge amount of other genomic markers such as single nucleotide polymorphisms (SNPs) cause problems in model fitting because of over-parameterization regression models in association studies (Ehsani et al., 2016). The relatively small number of highly informative EST-SSRs can help preventing such backwards.
The amount and the pattern of expression of different genes and transcripts are not homogeneous across all organs and tissues. Studies showed that tissue-specific expression is a common phenomenon in live organisms (Stamatoyannopoulos 2004;Ehsani et al., 2016). In other words, any given part of coding regions including ESTs may be differently expressed in different organs and tissues. This may help understanding the effects of any given gene on the performance of such organ or tissue (Janatova & Pohlreich, 2004).
Fertility traits have major concerns in animal breeding and improvement of fertility rate, using genetic tools is an important goal in this era (Muller et al., 2017). Typically, the male animals have a higher genetic contribution to the next generations in the industrial mating systems due to artificial insemination and the testes are the most important organs of male animals (Garcia-Ruiz et al., 2016). Studies shown that many expressed genes in the testis can modulate the fertility, survival, metabolic processes, and immune system (Djureinovic et al., 2014).
With the passage of time, approaches for various branches of biological science began to change, but it is important how we use it in contrast to the traditional methods. Basically, the functional genomic information adapts quickly to changing conditions, including the study of molecular markers. While there are number of studies indicating that EST-SSRs could substantially be a reliable source of classification and functional studies for either plants or animals (Taheri et al., 2019;Bakhtiarizadeh et al., 2012b), this study showed that it can also be considered an efficient way of examining a particular tissue such as testis. In this study, the analysis of the EST-SSRs from cattle testis tissue was conducted to find the relationship among such sequences with functional domains. We tried to understand their frequency and distributions as well as to categorize them based on their types and structure to help the use of such biological markers to identify the genes and biological processes related to testis functionality. The results of this study can promote EST-SSR-based detection tool for different organs which are associated with reproductive system in the future researches, and will be useful resources for molecular breeding, genetics, and genomics. Moreover, the conservation of domains which have been found in cattle testis tissue would be truly a new resource to identify useful alleles in transcription fac-tors, regulation of gene expression, spermatogenesis, innate immune response and the other important factors.

Retrieving EST libraries
First of all, 48,549 EST sequences of cattle's testis tissue were downloaded from the EST database of NCBI website (http://www.ncbi.nlm.nih.gov/dbeST). To focus on the sequences that their library made from testis tissue only, we subtracted the sequences that their tissue from a pooled of several tissues including testis. This is done by looking at the "tissue type" in the FASTA format downloaded sequences into a Notepad spreadsheet. From a given accession number, the tissue type changed from testis to a "pooled" of many tissues. We removed these sequences and called the remaining sequences (10,237) as testis specific expressed sequences. After cleaning the redundant parts of vectors attached to sequences, removing short length sequences (<150) and poly A (T) tails using EST-clean software (Tae et al., 2012), 2,039 contigs and 5,097 singletons were extracted using Vector NTI software (Lu & Moriyama, 2004).

Microsatellite Identification
The contigs and singletons collections were loaded into the Perl script MISA (Thiel et al., 2003). The SSRs containing motifs ranging from 2 to 6 nucleotides in length were selected. The minimum repeat for motifs set to be 6 repeats for dinucleotides, 5 repeats for trinucleotides, and 4 or more repeats for tetra-, penta-, and hexa-nucleotides. The collected contigs and singletons based on the above-mentioned criteria were used for further gene ontology and functional analysis.
In order to find the functional EST-SSRs, the collected contigs and singletons submitted into GenBank nonredundant database using BlaSTX (http://blast.ncbi.nlm. nih.gov/Blast.cgi) at an E value of 1.0*10E-10 for maximum similarity. Classification of selected sequences was based on their molecular function, biological process, and cellular component by searching their names or abbreviation in the UniProt database (http://www.uniprot. org/). The chromosome regions of EST-SSRs were finally mapped using Map Viewer (http://www.ncbi.nlm.nih. gov/mapview/) and the overall view of mapped genome for observed EST-SSRs was visualized using MapChart version 2.3 (Voorrips 2002). Furthermore, the type and the frequency of amino acids coded by the resulted functional EST-SSRs were predicted using DnaSP software (http://www.ub.edu/dnasp/).

Visual Classifications
Analysis of testis specific EST-SSRs using MISA (Thiel et al., 2003) revealed 153 SSRs within EST sequences out of which 30.51% (n=43) were contigs and 69.49 percent (n=110) were singletons (Table 1). The length of SSRs was ranging from 12 to 246 base pairs. Surprisingly, some of the EST sequences contained more than one microsatellites becoming imperfect microsatellites (Mudunuri & Nagarajaram, 2007). The observed SSRs were 94.11% perfect and the rest were imperfect EST-SSRs.
The SSR motifs on the basis of length were classified into two groups, class I included the repeats ranging from 10 to 20 nucleotides and class II included the repeats that have more than 20 nucleotides. The percentages for total SSRs for class I and class II were 76.88 and 23.12, respectively ( Figure 3).
The results were included 48 types of various motifs in different frequencies ( Table 2). The highest frequencies belonged to GT with 6.32%, GCC with 5.69%, and TG with 5.06%.

Functional Classifications
Further analysis using BLASTX showed that from 153 SSRs, 54 of them were belonged to domains that have biological functions. The classified domains and their related motifs with regard to their major role of genes in cattle testis tissue was represented in Table 3. The resulted domains were mostly categorized into spermatogenesis, energy activity, regulation of transcription and translation. Many ESTs were found to be in several categories.
Annotation and gene ontology (GO) analysis for molecular function showed that the most EST-SSR sequences were involved in protein and nucleic acid binding ( Figure 4A). The translation and transcription processes had the main role in biological processes compared to the other roles ( Figure 4B). The GO assignments for cellular components showed that about one third (32.14%) of SSR-containing ESTs were related to nucleus, ~16% related to membranous, ~10% related to organelles, and the rest that were ~41% were related to cytoplasm ( Figure 4C).
Of the 153 sequences containing SSRs, 140 of them were attached to Bos taurus chromosomes using map-viewer software (http://www.ncbi.nlm.nih.gov/ mapview/). Generally, the distributions of SSRs loci were indicated that the majority of markers were located on the long chromosomes. Chromosome 7 had the highest frequency of the linkage between markers and genes, unlike chromosomes 26 and 28 that had no SSR linked to any gene ( Figure 5). Sequences containing motif having special function have been analyzed to predict the amino acid sequences, and the results are shown in Figure 6. The most abundant codon was CUG followed   by GCC and CAG, which were codes for leucine, alanine, and glutamine, respectively.

DISCUSSION
Our study showed that the freely available SSRscontaining EST sequences from EST library may be a good source of information to study functional domains, motifs distribution along genome, annotation, and gene ontology analysis. In this study, SSRs were treated as markers within EST sequences. Our method possibly helps to reveal the functional importance of the publicly available sequences of testis tissue. The study showed that the most frequent SSRs were tri-nucleotide and di-nucleotide repeats. As the length of the simple sequences increase, the number of the repeats decrease. This is possibly due to the higher rate of mutations in the longer length sequences which is naturally true and therefore they are less stable (Amos & Filipe, 2014).
The structure of SSRs may be changed by mutations, therefor the repeat of copy numbers will be changed. This transformation within microsatellite loci converts the perfect SSR to imperfect SSR (Sharma et al., 2007). The investigation of EST-SSRs based on sequence types (perfect vs. imperfect) showed that perfect SSRs had higher frequency. Previous studies showed that the perfect microsatellites are less stable than imperfect microsatellites resulted by a mutation in their sequences and some of these imperfect SSRs have gene regulatory functions (Mudunuri & Nagarajaram, 2007). As an example from this study, the TA motif initially had 16 repeats and subsequently with a relatively low distance, its repetition started again with 6 more repeats. This region is one of the imperfect SSRs that is related to SH2 domain and involves in gene regulation of intracellular signaling.
The tri-nucleotide repeats had the highest frequency, followed by dinucleotide repeats (Figure 3). It is a considerable point for inverse relationship between microsatellite length and their frequency (Molla et al., 2015). In general, SSRs in Class II (sequences that were longer than 20 nucleotides) tend to be more variable and this class is more likely to preserve against slippedstrand abnormally (Temnykh et al., 2001). Moreover, the SSRs within Class II, were more polymorphic than the SSRs of Class I, as was confirmed by the experimental data in human (Weber, 1990). GT repeats were the most common in dimeric repeats as expected from previous studies on vertebrates (Toth et al., 2000) but in contrast to the other reports for cattle (Yan et al., 2008), sheep (Zhang et al., 2010), and chicken (Bakhtiarizadeh et al., 2012a). This is unlikely to be true in plants that the AT/TA repeats have the highest frequency. In fact, this difference may be due to selection of ESTs of only one tissue (testis) in our study, whereas the reports from previous studies are based on global frequencies from all tissues.
Among the trimer repeats, GCC was the most abundant followed by GGC which was in agreement with previous studies in cattle and other mammalians (Li et al., 2004). GCC codes for alanine amino acid. Abnormal frequency and distribution of this polyalanine repeat can cause cleidocranial dysplasia that is a genetic anomaly of the unusual cellular process (Mundlos et al., 1997).
Distribution pattern of codons within EST-SSRs of related domains indicated that CUG, GCC, and CAG had the higher frequencies that code leucine, alanine, and glutamine, respectively. A study on the mouse testis has been shown that CUG codon modulates in generation of Thioredoxin/Glutathione Reductase (Gerashchenko et al., 2010). The GCC codon by 5 repeats was linked to the Tektin-2 domain that plays an important role in sperm flagellar structure (Tanaka et al., 2004). CAG has been reported as an effective codon on quantitative and qualitative features of sperm traits (Mostafa et al., 2012).
SSRs usually applied as genetic markers to construct linkage maps and genetic diversity studies in noncoding regions (Blair et al., 2003). So, EST-SSRs can be used to tracing the transcribed regions of genome and study the functional genes. BLASTX analysis showed that EST-derived microsatellite had the variety of cat-egories matched to known proteins in public databases. As an example, most of the sequences that contain GT motif with E value of 1.04e-24 were matched to R3H domain. This domain is one of a group of metazoan proteins that are related to the sperm-associated antigen 7. In general, the transcription and translation factors, cell cycle and metabolic processes were the most frequent functions for the observed EST-SSRs.
The chromosomal locations of observed EST-SSRs were mapped via in silico mapping of the Bos taurus genome ( Figure 5). As expected, there was a lack of link among testis-related genes and X chromosome, which is proved by previous studies (Moore et al., 2005). There were 13 out of 153 sequences that did not attached to Bos taurus chromosomes. This difference was due to a low percentage of identity and query cover of EST-SSR sequences. The identified EST-SSRs loci, highlighted by red color in Figure 5 were the EST-SSRs that are known as domains and can be considered as regulating genes for reproductive traits.

CONCLUSION
Many of EST-SSRs are related to known domains in testis tissue. Polymorphisms that identified via SSRs especially the tri-nucleotides class II microsatellite repeats, may cause significant differences in their biological functions. As a result, localization and characterization of such markers can help tracing production of amino acids coded by identified repeats as shown in this study. Furthermore, with relatively fewer number of highly informative EST-SSRs compared to the other markers such as SNPs, it can help model fitting in genomic analysis and avoid over-parameterization.

CONFLICT OF INTEREST
The authors declare that they have no conflict of interests with any financial, personal, or other relationships with other people or organization related to the material discussed in the manuscript.