What is promoter region




















There are specific sequences that are generally found within a promoter region, but sometimes people refer to even extended promoter region that might include sequences that are farther upstream of the gene that might help enhance or repress the particular gene that's about to be transcribed in certain cell types.

In general, if you think of the promoter as that piece of DNA that's just upstream of the transcription start site of a gene, that's pretty much what we refer to as promoters.

Elliott Margulies, Ph. In order to train a model that can accurately perform promoter and non-promoter sequences classification, we need to choose the negative set non-promoter sequences carefully. This point is crucial in making a model capable of generalizing well, and therefore able to maintain its precision when evaluated on more challenging datasets. Previous works, such as Qian et al. Obviously, this approach is not completely reasonable because if there is no intersection between positive and negative sets.

Thus, the model will easily find basic features to separate the two classes. For instance, TATA motif can be found in all positive sequences at a specific position normally 28 bp upstream of the TSS, between —30 and —25 pb in our dataset. Therefore, creating negative set randomly that does not contain this motif will produce high performance in this dataset. However, the model fails at classifying negative sequences that have TATA motif as promoters.

In brief, the major flaw in this approach is that when training a deep learning model it only learns to discriminate the positive and negative classes based the presence or absence of some simple features at specific positions, which makes these models impracticable. In this work, we aim to solve this issue by establishing an alternative method to derive the negative set from the positive one.

Our method is based on the fact that whenever the features are common between the negative and the positive class the model tends, when making the decision, to ignore or reduce its dependency on these features i. Instead, the model is forced to search for deeper and less obvious features. Deep learning models generally suffer from slow convergence while training on this type of data. However, this method improves the robustness of the model and ensures generalization.

We reconstruct the negative set as follows. Each positive sequence generates one negative sequence. The positive sequence is divided into 20 subsequences.

Then, 12 subsequences are picked randomly and substituted randomly. The remaining 8 subsequences are conserved. This process is illustrated in Figure 1. Applying this process to the positive set results in new non-promoter sequences with conserved parts from promoter sequences the unchanged subsequences, 8 subsequences out of This ratio is found to be optimal for having robust promoter predictor as explained in section 3.

The sequence logos of the positive and negative sets for both human and mouse TATA promoter data are shown in Figures 2 , 3 , respectively. Therefore, the training is more challenging but the resulted model generalizes well. Figure 1. Illustration of the negative set construction method. Green represents the randomly conserved subsequences while red represents the randomly chosen and substituted ones.

Figure 2. The plots show the conservation of the functional motifs between the two sets. Figure 3. We propose a deep learning model that combines convolution layers with recurrent layers as shown in Figure 4. The input is one-hot encoded and represented as a one-dimensional vector with four channels. In order to select the best performing model, we have used grid search method for choosing the best hyper-parameters.

The tuned hyper-parameters are the number of convolution layers, kernel size, number of filters in each layer, the size of the max pooling layer, dropout probability, and the units of Bi-LSTM layer. The proposed model starts with multiple convolution layers that are aligned in parallel and help in learning the important motifs of the input sequences with different window size.

We use three convolution layers for non-TATA promoter with window sizes of 27, 14, and 7, and two convolution layers for TATA promoters with window sizes of 27, All convolution layers are followed by ReLU activation function Glorot et al.

Then, the outputs of these layers are concatenated together and fed into a bidirectional long short-term memory BiLSTM Schuster and Paliwal, layer with 32 nodes in order to capture the dependencies between the learnt motifs from the convolution layers.

Then we add two fully connected layers for classification. The first one has nodes and followed by ReLU and dropout with a probability of 0. This is achieved through the LSTM structure which is composed of a memory cell and three gates called input, output, and forget gates.

These gates are responsible for regulating the information in the memory cell. In addition, utilizing the LSTM module increases the network depth while the number of the required parameters remains low. Having a deeper network enables extracting more complex features and this is the main objective of our models as the negative set contains hard samples. The Keras framework is used for constructing and training the proposed models Chollet F.

Adam optimizer Kingma and Ba, is used for updating the parameters with a learning rate of 0. The batch size is set to 32 and the number of epochs is set to Early stopping is applied based on validation loss. In this work, we use the widely adopted evaluation metrics for evaluating the performance of the proposed models.

These metrics are precision, recall, and Matthew correlation coefficient MCC , and they are defined as follows:. Where TP is true positive and represents correctly identified promoter sequences, TN is true negative and represents correctly rejected promoter sequences, FP is false positive and represents incorrectly identified promoter sequences, and FN is false negative and represents incorrectly rejected promoter sequences.

When analyzing the previously published works for promoter sequences identification we noticed that the performance of those works greatly depends on the way of preparing the negative dataset. They performed very well on the datasets that they have prepared, however, they have a high false positive ratio when evaluated on a more challenging dataset that includes non-prompter sequences having common motifs with promoter sequences.

For instance, in case of the TATA promoter dataset, the randomly generated sequences will not have TATA motif at the position and —25 bp which in turn makes the task of classification easier. In other words, their classifier depended on the presence of TATA motif to identify the promoter sequence and as a result, it was easy to achieve high performance on the datasets they have prepared.

However, their models failed dramatically when dealing with negative sequences that contained TATA motif hard examples. The precision dropped as the false positive rate increased. Simply, they classified these sequences as positive promoter sequences. A similar analysis is valid for the other promoter motifs. Therefore, the main purpose of our work is not only achieving high performance on a specific dataset but also enhancing the model ability on generalizing well by training on a challenging dataset.

To more illustrate this point, we train and test our model on the human and mouse TATA promoter datasets with different methods of negative sets preparation.

The first experiment is performed using randomly sampled negative sequences from non-coding regions of the genome i. These high results are expected, but the question is whether this model can maintain the same performance when evaluated on a dataset that has hard examples.

The answer, based on analyzing the prior models, is no. The second experiment is performed using our proposed method for preparing the dataset as explained in section 2.

This ensures that our model learns more complex features rather than learning only the presence or absence of TATA-box. Figure 5. Over the past years, plenty of promoter region prediction tools have been proposed Hutchinson, ; Scherf et al. However, some of these tools are not publically available for testing and some of them require more information besides the raw genomic sequences.

The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. Chou, K. Prediction of protein signal sequences and their cleavage sites. Proteins 42, — Coles, R. Functional analysis of the huntington's disease HD gene promoter. Davuluri, R. Computational identification of promoters and first exons in the human genome. Biologicals 42, 22—8. Linking disease-associated genes to regulatory networks via promoter organization.

Nucleic Acids Res. Down, T. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res.

Gama-Castro, S. RegulonDB version 9. Habibi, M. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, i37—i Hamid, M. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 35, — Structural properties of gene promoters highlight more than two phenotypes of diabetes.

Ioshikhes, I. Large-scale human promoter mapping using CpG islands. Keller, J. A fuzzy k-nearest neighbor algorithm. IEEE Trans. Man Cybern. Knudsen, S. Bioinformatics 15, — Le, N. Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles.

Methods Programs Biomed. PeerJ Comp. Li, Q. Lin, H. Identifying sigma70 promoters with novel pseudo nucleotide composition. Liu, B. Bioinformatics 34, 33— Nguyen, T. Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network. Ohler, U. Interpolated markov chains for eukaryotic promoter recognition.

A novel methodology on distributed representations of proteins using their interacting ligands. Bioinformatics 34, i—i Ponger, L. C, and Mouchiroud, D. CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 18, — Prestridge, D. Predicting Pol II promoter sequences using transcription factor binding sites.

Reese, M. Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Song, K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Srivastava, N. Dropout: a simple way to prevent neural networks from overfitting. Google Scholar. Umarov, R. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. Wei, L. CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency.

Proteome Res. Xiao, X. Genomics Yang, Y. Whereas DNA is generally depicted as a straight line in two dimensions, it is actually a three-dimensional object. Therefore, a nucleotide sequence thousands of nucleotides away can fold over and interact with a specific promoter.

Enhancers : An enhancer is a DNA sequence that promotes transcription. Each enhancer is made up of short DNA sequences called distal control elements.

Activators bound to the distal control elements interact with mediator proteins and transcription factors. Like prokaryotic cells, eukaryotic cells also have mechanisms to prevent transcription. Transcriptional repressors can bind to promoter or enhancer regions and block transcription.

Like the transcriptional activators, repressors respond to external stimuli to prevent the binding of activating transcription factors. A corepressor is a protein that decreases gene expression by binding to a transcription factor that contains a DNA-binding domain. The corepressor is unable to bind DNA by itself. The corepressor can repress transcriptional initiation by recruiting histone deacetylase, which catalyzes the removal of acetyl groups from lysine residues. This increases the positive charge on histones, which strengthens the interaction between the histones and DNA, making the DNA less accessible to the process of transcription.

Both the packaging of DNA around histone proteins, as well as chemical modifications to the DNA or proteins, can alter gene expression. Discuss how eukaryotic gene regulation occurs at the epigenetic level and the various epigenetic changes that can be made to DNA. The human genome encodes over 20, genes; each of the 23 pairs of human chromosomes encodes thousands of genes.

The DNA in the nucleus is precisely wound, folded, and compacted into chromosomes so that it will fit into the nucleus. It is also organized so that specific segments can be accessed as needed by a specific cell type. The first level of organization, or packing, is the winding of DNA strands around histone proteins.

Histones package and order DNA into structural units called nucleosome complexes, which can control the access of proteins to the DNA regions.

Under the electron microscope, this winding of DNA around histone proteins to form nucleosomes looks like small beads on a string. These beads histone proteins can move along the string DNA and change the structure of the molecule.

These nucleosomes control the access of proteins to the underlying DNA. When viewed through an electron microscope b , the nucleosomes look like beads on a string. Nucleosomes can move to open the chromosome structure to expose a segment of DNA, but do so in a very controlled manner. Nucleosomes can change position to allow transcription of genes : Nucleosomes can slide along DNA. When nucleosomes are spaced closely together top , transcription factors cannot bind and gene expression is turned off.

When the nucleosomes are spaced far apart bottom , the DNA is exposed. Transcription factors can bind, allowing gene expression to occur. Modifications to the histones and DNA affect nucleosome spacing. How the histone proteins move is dependent on signals found on both the histone proteins and on the DNA.

These signals are tags, or modifications, added to histone proteins and DNA that tell the histones if a chromosomal region should be open or closed. These tags are not permanent, but may be added or removed as needed. They are chemical modifications phosphate, methyl, or acetyl groups that are attached to specific amino acids in the protein or to the nucleotides of the DNA.

The tags do not alter the DNA base sequence, but they do alter how tightly wound the DNA is around the histone proteins. DNA is a negatively-charged molecule; therefore, changes in the charge of the histone will change how tightly wound the DNA molecule will be.

When unmodified, the histone proteins have a large positive charge; by adding chemical modifications, such as acetyl groups, the charge becomes less positive. Modifications affect nucleosome spacing and gene expression.

The DNA molecule itself can also be modified. This occurs within very specific regions called CpG islands. These are stretches with a high frequency of cytosine and guanine dinucleotide DNA pairs CG found in the promoter regions of genes. When this configuration exists, the cytosine member of the pair can be methylated a methyl group is added. This modification changes how the DNA interacts with proteins, including the histone proteins that control access to the region. Highly-methylated hypermethylated DNA regions with deacetylated histones are tightly coiled and transcriptionally inactive.

These changes to DNA are inherited from parent to offspring, such that while the DNA sequence is not altered, the pattern of gene expression is passed to the next generation. This type of gene regulation is called epigenetic regulation. Instead, these changes are temporary although they often persist through multiple rounds of cell division and alter the chromosomal structure open or closed as needed.

A gene can be turned on or off depending upon the location and modifications to the histone proteins and DNA. If a gene is to be transcribed, the histone proteins and DNA are modified surrounding the chromosomal region encoding that gene.



0コメント

  • 1000 / 1000