What is AC Automata? How to achieve it?

What is AC Automata?
It is an efficient multi-pattern matching algorithm based on Trie tree and KMP mismatch pointer. AC automata can be built at once and then match multiple sensitive words simultaneously while traversing the text.
The typical application of AC automata algorithm is sensitive word matching. This algorithm comes in handy when users of major social media platforms such as Weibo, Zhihu, Douyin, etc. post comments. The most typical one is Honor of Kings. When players post some violence in the chat box, they will be marked as '*' display.
If this algorithm is not used, how can we achieve matching of sensitive words?
I thought of using hash table search to achieve sensitive matching.
Analyze the advantages and disadvantages of AC automata and hash table search

Hash table

Hash table matching is by storing sensitive words in the hash table and then checking one by one whether each substring in the text exists in the hash table. Each time a substring is taken out from the text, check whether the substring exists in the hash table. The search time complexity of the hash table is 𝑂(1).
AC automatic machine

Use a Trie tree to store multiple sensitive words, ensuring that the shared prefix of each sensitive word is not stored repeatedly.
The mismatch pointer is used to jump to the next possible matching position when the match fails, avoiding rescanning the text.
The time complexity of the AC automaton during the search phase is O(m), where m is the text length.

Time complexity
Hash table
Preprocessing time: The time complexity of inserting sensitive words into the hash table is O(k), where k is the total length of all sensitive words.
Match time: For each possible substring, the hash table needs to be validated one by one. Assuming that the text length is m and the longest sensitive word length is L, the time complexity of O(m⋅L) is required to perform substring inspections one by one.
AC automatic machine
Preprocessing time: The time complexity of building Trie trees and mismatch pointers is O(k), the same as the hash table.
Matching time: The AC automaton only needs to traverse the text once, and each time it transfers state along the Trie tree. The matching time complexity is O(m), where m is the text length.
Space complexity
Hash table
Space consumption: The hash table needs to store independent key-value pairs for each sensitive word. The size of the hash table is proportional to the number of sensitive words, so the spatial complexity is O(n), where n is the number of sensitive words.
Sensitive words are stored independently: Hash tables cannot use prefix sharing between sensitive words, so each sensitive word needs to be stored independently, which can easily lead to repeated storage of longer prefixes.
AC automatic machine
Space consumption: The Trie tree of AC automaton utilizes prefix sharing between sensitive words, reducing space consumption. The spatial complexity of constructing a Trie tree is O(k), where k is the total number of characters for all sensitive words.
Missing pointer: The mismatch pointer requires extra space, but since the mismatch pointer only points to other nodes in the Trie tree, the extra consumption is not high.
Match performance
Hash table
Character-by-character matching: Hash table matching is less efficient because it requires searching by substring. For each starting position, it must start with 1 character and gradually expand to the longest sensitive word length until the sensitive word is found or the entire text scan is completed. Therefore, its time complexity will multiply with the text length.
Limitations: Hash tables cannot effectively handle sensitive words with the same prefix. For example, for the sensitive words "ab" and "abc", if you read "abc" from the text, the hash table will need to check the three substrings "a", "ab", and "abc" respectively.
AC automatic machine
One-time Matching: The design of AC automaton can complete multi-pattern matching at one time. During the process of traversing the text, it can detect the occurrence of all sensitive words at the same time, and each character will be traversed only once, avoiding repeated substring checks.
High efficiency: The time complexity of AC automaton is O(m), and the matching performance is significantly better than the hash table when processing large texts and large number of sensitive words.
Code implementation:
Tree construction (prefix tree)

Click to view the code

// Build Goto table (Trie tree)
 void AcString::BuildGotoTable() {
     for (size_t i = 0; i < (); ++i) {
         const std::string& word = patterns[i].pattern;
         int current = 0; // Start from the root node
         for (char ch : word) {
             if (nodes[current].(ch) == nodes[current].()) {
                 AddState(current, ch);
                 nodes[current].sons[ch] = () - 1;
             }
             current = nodes[current].sons[ch];
         }
         // Add output to the last node
         nodes[current].output.push_back(i);
     }
 }

one, The traversal of the fail pointer is hierarchical traversal two, The fail pointer of the root node points to itself three, If the fail pointer of the current node parent node points to the same child node as the current node, the fail pointer of the current node points to the child node, otherwise it points to the root node.

Construction of pointers

Click to view the code

void AcString::BuildFailTable() {
     std::queue<int> q;
     // Initialize the fail pointer of the child node of the root node to the root node
     for (auto& pair : nodes[0].sons) {
         int child = ;
         nodes[child].fail = 0;
         (child);
     }

     // BFS traversal the Trie tree and build a fail pointer
     while (!()) {
         int current = ();
         ();

         for (auto& pair : nodes[current].sons) {
             char ch = ;
             int child = ;

             // Find out whether the child node of the node pointed to by the fail pointer of the current node has characters ch
             int fail = nodes[current].fail;
             while (fail != -1 && nodes[fail].(ch) == nodes[fail].()) {
                 fail = nodes[fail].fail;
             }

             if (fail == -1) {
                 nodes[child].fail = 0; // Back to the root node
             } else {
                 nodes[child].fail = nodes[fail].sons[ch];
                 // Add the output of the fail node to the output of the current node
                 for (int index : nodes[nodes[fail].sons[ch]].output) {
                     nodes[child].output.push_back(index);
 //If fail == -1, it means that the current node has not found a suitable failed jump path, so the fail pointer of the child child is pointed to the root node nodes[0].
 Otherwise, point the child child's fail pointer to the child node of the fail node (i.e. nodes[fail].sons[ch], indicating that the character ch can be continued to be matched through the fail node).
 Next, append the output mode (output) of the fail node to the output list of the child node child.  This step is because the fail node has matched some patterns, so if the child node also reaches this state, it should inherit these matching patterns.
                 }
             }

             (child);
         }
     }
 }

![](/blog/3367816/202410/)