Abstract

Functional region identification is of fundamental importance for protein sequences analysis. Such knowledge provides better scientific understanding and could assist drug discovery. Up-to-date, domain annotation is one approach, but it needs to leverage existing databases. For de novo discovery, motif discovery locates and aligns locally homologous sub-sequences to obtain a position-weight matrix (PWM), which is a fixed-length representation model, whereas protein functional region size varies. It thus requires computational expensive exhaustive search to obtain a PWM with width of optimal range. This paper presents a new method known as pattern-directed aligned pattern clustering (PD-APCn) to discover and align patterns in conserved protein functional regions. It adopts aligned pattern cluster (APC) with patterns of variable length and strong support to direct the incremental APC expansion. It allows substitution and frame-shift mutations until a robust termination condition is reached. The concept of breakpoint gap is introduced to identify spots of mutations, such as substitution and frame shifts. Experiments on synthetic data sets with different sizes and noise levels showed that PD-APCn outperforms MEME with much higher recall and Fmeasure and computational speed 665 times faster that MEME. When applying to Cytochrome C and Ubiquitin families, it found all key binding sites within the APCs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.