[Algorithm] BM Algorithm for String Matching, Implemented in C Language

Today, we will continue the topic from yesterday, the BM algorithm for string matching. BM can be considered a more efficient string matching algorithm compared to the KMP algorithm. BM is the masterpiece of the master Boyer-Moore, hence the name BM algorithm. Compared to the KMP algorithm, BM algorithm improves efficiency significantly. In terms of space, BM algorithm requires an auxiliary space that is the same size as the matching character set to store different matching characters. This is a bit wasteful compared to KMP, but it is also a feature of BM. It can be used with different character sets. If there are two character sets, then an auxiliary space of the same size as the character set can be used. This is very good for handling complex characters. Currently, most high-level languages such as C# use BM and its improved algorithm (AC-BM algorithm). Compared to KMP, which matches the appearance of two Chinese characters in half-width, I still prefer BM. Although it wastes space, it achieves a consumption close to sublinear, reducing the matching time by more than n. This point is also objective.

BM algorithm has many derivative algorithms, and AC-BM algorithm is one of them. It is optimized using mathematical methods and improves the efficiency by a constant level in the best case, enhancing the utilization of indexes. I will write about it next time. Algorithm principle: Scanning from the end of the string, it utilizes the replacement principle of matching suffixes and invalid characters. Overall, the efficiency is greatly improved. The algorithm is as follows. If you have any questions about the specific algorithm comments, please leave a message or contact me, and I will try to answer when I have time.

Debugging is welcome. I haven't had time to debug it under the TC environment and GCC, but there should be no problem after some modifications. BM string matching algorithm:

/*BM string matching algorithm*/
/*code by CG lidaren.com
* ACM yctc
*2008 12 20
*/
#include "stdio.h"
#include "string.h"
#include "stdlib.h"

#define LEN 256
/*LEN uses one character set. For ASCII encoding, one byte is used to represent a character. For UNICODE, it needs to be considered separately. This is where the advantage of BM algorithm lies.*/

/*BM() BM algorithm basic function
*Input:
* char *s matching string
* char *p pattern string
* int index matching start index
* int post[] auxiliary array
*Return:
* int next matching start index, -1 if matching fails
*/
int BM(char *s, char *p, int index, int post[]) {
int len = strlen(s);
int i,j, next;
i = strlen(p)-1;/*Subtract 1 from the string length*/
j = index+strlen(p)-1;/*When calling BM() for the first time, index = 0, because the for loop below compares from the end of the pattern string, so the initial comparison position of the matching string should start from the beginning, counting the length of the pattern string.*/
for(; i>=0; i--, j--) {
if(s[j] != p[i]){/*Matching of the first character*/
break;
}
}/*for*/

if(i<0) /*Matching completed?*/
return 0; /*Matching successful*/

else if(post[s[j]]>0)
/*When a mismatch occurs, check if the character at the current position of the matching string appears in the pattern string*/
next = index + i - post[s[j]];

/*index is the current offset of the matching string, i is the remaining number of comparison substrings in the pattern string,
* post[s[j]] is the position of the first mismatched character in the matching string.
* Next time, the comparison will start from the position in the matching string where s[j] appears
*/
else next = index + 1;

if(next > LEN-strlen(p))
return -1; /*Matching failed, unable to proceed to the next match*/
else
return next; /*Matching failed, need to proceed to the next match*/
}/*BM*/

/*Test, both the matching string and the pattern string use lowercase characters*/
int main()
{
int post[LEN]={0}; /*Auxiliary array = character set size*/

char \*src="aaaabbbaababababbabb";/\*Test string\*/
char \*patten="aabbabb";

int i, next, index=-2, pos=0;/\*Initialize index flag\*/

for(i=0; i
index = BM(src, patten, 0, post);/\*First match, starting from position 0, obtain NEXT\*/

while(!(index == -1 || index == 0)) /\*Loop until matching succeeds\*/
{
  next = index;
  index = BM(src, patten, next, post);/\*Next BM match\*/
}/\*while\*/

if(index == -1){ /\*Failed\*/
   printf("Match failed.\n");
}

if(index == 0){ /\*OK\*/
   printf("The index is: %d.\n", next);
}
return 0;

}/*main*/

I searched for some information about the BM algorithm online. Please refer to http://blog.chinaunix.net/u/11828/showart_242074.html for detailed explanations.