Motifs, known as short and recurring patterns in DNA sequences, are presumed to have biological significance.
They can indicate binding sites for transcription factors, which regulate gene expression. Finding motifs from biological
sequences is a major task for unraveling the mechanisms of gene expression. Many algorithms have been designed for
various problem models, among which, the problem model with substitution, deletion and insertion in motifs is especially
a challenge. In 2010, Le et al. designed an algorithm named HIGEDA, which combines EM (expectation maximization)
algorithm with GA (genetic algorithm), to find gapped motifs (motifs with substitution and deletion) in biological
sequences. In this work, we improve HIGEDA by developing an algorithm named GAEM to solve the planted edited
motif finding problem, i.e., to find motifs with substitution, deletion, and also insertion in DNA sequences. We test
GAEM on both simulated DNA datasets and DNA transcription factor binding site datasets. More precisely, performance
evaluation of GAEM is carried on simulated datasets for the planted edited motif finding problem, where the simulation
results show that GAEM can recover motifs with higher and more stable success rate as compared with HIGEDA. For
DNA transcription factor binding site datasets, one eukaryotic dataset, ere, and four Escherichia coli datasets, crp, arcA,
argR and purR, are used to compare GAEM with HIGEDA, GLAM2 and MEME. The simulation results show that
GAEM performs better on finding motifs in realistic DNA datasets as compared with the other three algorithms.