Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Pattern Matching and Quick String Search

Old News

See Also Recommended Books Recommended Links Lecture Notes and E-books Donald Knuth TAOCP Volume3 Animations
Knuth-Morris-Pratt algorithm Boyer-Moore string search algorithm Sunday string search algorithm Hume and Sunday -- Tuned Boyer-Moore
      and Least Cost
 Harrison Karp-Rabin Regular expressions Etc

Here is bibliography that can help to start in this area (See also STRING SEARCHING ALGORITHMS by Graham A Stephen)

Exact Text Searching

Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 255-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.

Abrahamson K., Generalized string matching, SIAM Journal on Computing, 16(6), 1039-1051, 1987.

Amir A., Landau G.M., and Vishkin U., Efficient pattern matching with scaling, Journal of Algorithms, 13(1), 2-32, 1992.

Apostolico A., and Giancarlo R., The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing, 15(1), 98-105, 1986.

Baeza-Yates R., and Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.

Baeza-Yates R., and Regnier M., Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Science, 92(1), 19-31, 1992.

Baeza-Yates R., Choffrut C., and Gonnet G., On Boyer Moore Automata, Algorithmica, 12(4-5), 268-292, 1994.

Baeza-Yates R., and Fuentes L., A framework to animate string algorithms, Information Processing Letters, 59(5), 241-244, 1996.

Baeza-Yates R., Improved string matching, Software-Practice and Experience, 19(3), 257-271, 1989.

Baeza-Yates R.A., Algorithms for String Searching: A Survey, ACM SIGIR Forum, 23(3-4), 34-58, 1989.

Baeza-Yates R., Text Retrieval: Theory and Practice, in Proc. of the 12th IFIP World Computer Congress, 465-476, (Madrid, Spain), North-Holland, 1992.

Baeza-Yates R., A unified view to string matching algorithms, Thoery and Pratice of Informatics, 23rd Seminar, 1-15, 1996.

Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., and Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40(1), 31-55, 1985.

Boyer R.S., and Moore J.S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977.

Breslauer D., Colussi L., and Toniolo L., Tight comparison bounds for the string prefix matching problem, Information Processing Letters, 47(1), 51-57, 1993.

Breslauer D., Saving comparisons in the Crochemore-Perrin string matching, Theoretical Computer Science, 158(1-2), 177-192, 1996.

Bruyere V., Baeza-Yates R., Dergrange O., and Scheihing R., About the size of Boyer-Moore automata, in Proc. of the 3rd South American Workshop on String Processing, 31-46, Carleton, University Press, 1996.

Charras C., and Lecroq T., Exact string matching algorithms, Technical Report, 1997.

Cole R., Hariharan R., Zwick U., and Paterson M.S., Tighter lower bounds on the exact complexity of string matching, SIAM Journal on Computing, 24(1), 30-45, 1995.

Cole R., Tight bounds on the complexity of the Boyer-Moore pattern matching, SIAM Journal on Computing, 23(5), 1075-1091, 1994.

Colussi L., Correctness and Efficiency of pattern matching algorithm, Information and Computation, 95(2), 225-251, 1991.

Colussi L., Fatest pattern matching in strings, Journal of Algorithms, 16( 2), 163-189, 1994.

Crochemore M., and Rytter W., Text Algorithms, Oxford University Press, 1994.

Crochemore M., Czymaj A., Gasieniec L., Jarominek S., Lecroq T., Plandowski W., and Rytter W., Speeding Up Two String Matching Algorithms, Algorithmica, 12(4-5), 247-267, 1994.

Crochemore M., and Lecroq T., Tight bounds on the complexity of the Apostolico-Giancarlo algorithm, Information Processing Letters, 63(4), 195-203, 1997.

Crochemore M., and Lecroq T., Pattern matching and text compression algorithms, in The Computer Science and Engineering Handbook, Tucker A.B., ed., CRC Press, Boca-Raton, Chapter 8, 162-202, 1997.

Crochemore M., and Hancart C., Automata for Matching Patterns, Handbook of Formal Languages, Rosenberg C., Salamaa A. eds, 2, 399-462, Springer-Verlag, 1997.

Crochemore M., and Hancart C., Pattern matching in strings, in Algorithms and theory of computation Handbook, Mikhail J., Atallah, ed. CRC Press, Boca Raton, 1998.

Crochemore M., String matching on ordered alphabets, Theoretical Computer Science, 92(1), 33-47, 1992.

Crochemore M., Off-line serial exact string searching, in Pattern Matching Algorithms, Apostolico A., Z. Galil, ed., Oxford University Press, New York, Chapter 1, 1-53, 1997.

Davies G., and Bowsher S., Algorithms for pattern matching, Software-Practice and Experience, 16(6), 575-601, 1996.

Ferragina P., and Grossi R., Optimal On-line search and sublinear time update in strings, SIAM Journal on Computing, 27(3), 713-736, 1998.

Galil Z., and Giancarlo R., On the exact complexity of string matching: Lower Bounds, SIAM Journal on Computing, 20(6), 1008-1020, 1991.

Galil Z., and Giancarlo R., On the exact complexity of string matching: Upper Bound, SIAM Journal on Computing, 21(3), 407-437, 1992.

Galil Z., On improving the worst case running time of the Boyer-Moore algorithm, Communications of the ACM, 22(9), 505-508, 1979.

Gasieniec L., Plandowski W., and Rytter E., The zooming method: a resursive approach to time-space algorithm, Theoretical Computer Science, 147(1-20), 19-30, 1995.

Gasieniec L., Plandowski W., and Rytter W., Sequential sampling: a new approach to constant space pattern matching, in Proc. of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, 78-89, Springer-Verlag, Berlin, 1995.

Gonnet G.H., and Baeza-Yates R.A., An analysis of the Karp-Rabin string matching algorithm, Information Processing Letters, 34(5), 271-274, 1990.

Gonnet G.H., and Baeza-Yates R., Handbook of Algorithms and Data Structures in Pascal and C, 2nd edition, Addison-Wesley, Workingham, 251-288, 1991.

Gusfield D., Algorithms on strings, trees and sequences, Gambridge University Press, 1997.

Hancert C., On Simon?s string searching algorithm, Information Processing Letters, 47(2), 95-99, 1993.

Horspool R.N., Practical fast searching in strings, Software-Practice and Experience, 10(6), 501-506, 1980.

Hume A., and Sunday D., Fast string searching, Software-Practice and Experience, 21(11), 1221-1248, 1991.

Knuth D.E., Morris J.H., and Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323-350, 1977.

Lecroq T., A variation on the Boyer-Moore algorithm, Theoretical Computer Science, 92(1), 119-144, 1992.

Lecroq T., Experimental results on string matching algorithms, Software-Practice and Experience, 25(7), 727-765, 1995.

Lecroq T., Experiments on string matching in memory structures, Software-Practice and Experience, 28(5), 561-568, 1998.

Liu Z., Du X., and Ishii N., An improved adaptive string searching algorithm, Software-Practice and Experience, 28(2), 191-198, 1998.

Manolopoulos Y., and Faloutsos C., Experimenting with pattern matching algorithms, Information Sciences, 90(1-4), 75-89, 1996.

Perleberg C., Single character searching methods and the shift-or pattern matching algorithm, Information Processing Letters, 50(5), 269-275, 1994.

Raita T., Tunning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 22(10), 879-884, 1992.

Reingold E., Urban K.J., and Gries D., K-M-P string matching revisited, Information Processing Letters, 64(5), 217-223, 1997.

Sedgewick R., Algorithms, 2nd edition, Addison Wesley, Reading Mass, 277-303, 1988.

Smit G. De V., A Comparison of Three String Matching Algorithms, Software-Practice and Experience, 12(1), 57-66, 1982.

Smith P., Experiments with a very fast substring search algorithm, Software-Practice and Experience, 21(10), 1065-1074, 1991.

Smith P., On tuning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 24(4), 425-436, 1994.

Stephen A.G., String Search, Techinal Report, School of Electronic Engineering Science, 1992.

Sunday D., A very fast substring search algorithm, Communications of the ACM, 33(8), 132-142, 1990.

Tarhio J., and Peltora H., String matching in the DNA alphabet, Software-Practice and Experience, 27(7), 851-861, 1997.

Watson B., The performance of single and multiple keyword pattern matching algorithms, in Proc. of the 3rd South American Workshop on String Processing, 280-294, Carleton, University Press, 1996.

Watson B., SPARE Parts: A C++ toolkit for string Pattern Recognition, in Proc. of the Second Prague stringologic Club Workshop, 47-60, 1997.

Wu S., and Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.

Approximate Text Searching

Aho A., Corasick M., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333-340, 1975.

Arlazarov V.L., Dinic E.A., Kronrod M.A., Faradzev I.A., On economic construction of the transitive closure of a directed graph, Dokl. Akad. Nauk SSSR, 194, 487-488, 1970 (in Russian). English translation in Soviet Math. Dokl., 11,1209-1210, 1975.

Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 2 55-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.

Aoe J., Computer Algorithms - String Pattern Matching Strategies, IEEE Computer Society Press, Los Alamitos, California, 1994.

Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 31-55, 1985.

Baeza-Yates R.A., Text Retrieval: Theory and practice, in Proc. 12th IFIP World Computer Congress, (Madrid, Spain), 1992.

Baeza-Yates R.A., A unified view of string matching algorithms, In SOFSEM 96: Theory and Practice of Informatics, 1-15, Springer-Verlag, 1996.

Baeza-Yates R.A., Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.

Baeza Yates R.A., Gonnet G., Fast string matching with mismatches, Information and Computation, 108(2), 187-199, 1994.

Baeza-Yates R.A., Navarro G., A fast heuristic for approximate string matching, in Proc WSP 96, 47-63, Carleton University Press, 1996.

Baeza-Yates R.A., Navarro G., A faster algorithm for approximate string matching, In Proc Combinatorial Pattern Matching 96, 1-23, Springer-Verlag, 1996.

Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate pattern matching, in Proc Combinatorial Pattern Matching, CPM 92, 185-192, 1992.

Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate string matching, Information Processing Letters, 59, 21-27, 1996.

Chang W.I, Lampe J., Theoretical and Emprirical Comparisons of Approximate String Matching Algorithms, In Proc. 3rd Combinatorial Pattern Matching, 175-184, University of Arizona, USA, 1992.

Chang W., Lawler E., Sublinear approximate string matching and biological applications, Algorithmica, 12(4/5), 327-344, 1994.

Chang W, Marr T., Approximate string matching and local similarity, In Proc Combinatorial Pattern Matching 94, 259-273, Springer-Verlag, 1994.

Crochemore M., String matching with constraints, in Proc. MFCS’ 88, 324, 44-58, 1988.

Commentz-Walter B., A string matching algorithm fast on the average, in Proc. ICALP 79, 118-132, Springer-Verlag, 1979.

Dermouche A., A fast algorithm for string matching with mismatches, Information Processing Letters, 55, 105-110, 1995.

Galil Z., Giancarlo R., Improved string matching with k mismatches, Sigact News, 17, 52-54, 1986.

Galil Z., Giancarlo R., Data Structures and Algorithms for Approximate String Matching, Journal of Complexit, 4, 33-72, 1988.

Giegerich R., Kurtz S., Hischke F., Ohlebusch E., A general technique to improve filter algorithms for approximate string matching, In Proc WSP 97, 38-52, Carleton University Press, 1997.

Grossi R., Luccio F., Simple and efficient string matching with k mismatches, Information Processing Letters, 33, 113-120, 1989.

Galil Z., Park K., An improved Algorithm for Approximate String Matching, SIAM Journal of Computing, 19(6), 989-999, 1990.

Hall P., Dowling G., Approximate string matching, ACM Computing Surveys, 12(4), 381-402, 1980.

Harel D., Tarhjan R.E., Fast algorithms for finding nearest common ancestors, SIAM Journal on Computing, 13(2), 338-355, 1984.

Jokinen P., Tarhio J., Ukkonen E., A Comparison of Approximate String Matching Algorithms, Software-Practice and Experience,  26(12), 1439-1458, 1996.

Kurtz S., Approximate string searching under weighted edit distance, in Proc. WSP 96, 156-170, Carleton University Press, 1996.

Landau G.M., Vishkin U., Efficient String Matching with k Mismatches, Theoretical Computer Science, 43, 239-249, 1986.

Landau G.M., Vishkin U., Fast String Matching with k Differences, Journal of Computer and System Sciences, 37, 63-78, 1988.

Landau G.M., Vishkin U., Fast Parallel and Serial Approximate String Matching, Journal of Algorithms, 10, 157-169, 1989.

McCreight E., A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2), 262-272, 1976.

Michailidis P., Margaritis K., String Matching Algorithms, Technical Report, Dept. of Applied Informatics, University of Macedonia, 1999 (in Greek).

Masek W.J., Paterson M.S., A Faster Algorithm Computing String Edit Distances, Journal of Computer and System Sciences, 20, 18-31, 1980.

Myers G., A fast bit-vector algorithm for approximate pattern matching based on dynamic programming. In Proc Combinatorial Pattern Matching 98, 1-13, Springer-Verlag, 1998.

Navarro G., Mutiple approximate string matching by counting, in Proc WSP 97, 125-139, Carleton University Press, 1997.

Navarro G., A partial deterministic automaton for approximate string matching, in Proc WSP 97, 112-124, Carleton University Press, 1997.

Navarro G., Approximate Text Searching, Ph.D. Thesis, University of Chile, Dept. of Computer Science, 1998.

Navarro G., Raffinot M., A bit-parallel approach to suffix automata: Fast extended string matching, in Proc Combinatorial Patern Matching 98, 14-33, Springer-Verlag, 1998.

Sellers P.H., The Theory and Computation of Evolutionary Distance: Pattern Recognition, Journal of Algorithms, 1, 359-373, 1980.

Sutinen E., Tarhio J., On using q-gram locations in approximate string matching, In Proc ESA 95, 327-340, Springer-Verlag, 1995.

Stephen G. A., String Searching Algorithms, World Scientific Press, 1994.

Takaoka T., Approximate pattern matching with samples, In Proc. ISAAC 94, 234-242, Springer-Verlag, 1994.

Tarhio J., Ukkonen E., Approximate Boyer-Moore String Matching, SIAM Journal on Computing, 22(2), 243-260, 1993.

Ukkonen E., Algorithms for Approximate String Matching, Information and Control, 64, 100-118, 1985.

Ukkonen E., Finding Approximate Patterns in Strings, Journal of Algorithms, 6, 132-137, 1985.

Ukkonen E., Approximate string matching with q-grams and maximal matches, Theoretical Computer Science, 1, 191-211, 1992.

Ukkonen E., Wood D., Approximate String Matching with Suffix Automata, Algorithmica, 10, 353-364, 1993.

Weiner P., Linear pattern matching algorithm, in Proc. 14th IEEE Symposium on Swithching and Automata Theory, 1-11, 1973.

Wagner R.A., Fischer M.J., The String to String Correction Problem, Journal of the Association for Computing Machinery,  21(1), 168-173, 1974.

Wu S., Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.

Wu S., Manber U., Myers G., A Subquadratic algorithm for approximate limited expression matching, Algorithmica, 15, 50-67, 1996.

Wright A., Approximate string matching using within-word parallelism, Software-Practice and Experience, 24( 4), 337-362, 1994.

 

Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index

Old News ;-)

[Nov 20, 2004] AGREP, an approximate GREP

[Nov 20, 2004] Implementation of the Boyer-Moore-Horspool algorithm

Fast String Searching With Suffix Trees by Mark Nelson Dr. Dobb's Journal August, 1996

I think that I shall never see
A poem lovely as a tree.
Poems are made by fools like me,
But only God can make a tree.

Joyce Kilmer

A tree's a tree. How many more do you need to look at?

Ronald Reagan

Matching string sequences is a problem that computer programmers face on a regular basis. Some programming tasks, such as data compression or DNA sequencing, can benefit enormously from improvements in string matching algorithms. This article discusses a relatively unknown data structure, the suffix tree, and shows how its characteristics can be used to attack difficult string matching problems.

The problem

Imagine that you've just been hired as a programmer working on a DNA sequencing project. Researchers are busy slicing and dicing viral genetic material, producing fragmented sequences of nucleotides. They send these sequences to your server, which is then expected to locate the sequences in a database of genomes. The genome for a given virus can have hundreds of thousands of nucleotide bases, and you have hundreds of viruses in your database. You are expected to implement this as a client/server project that gives real-time feedback to the impatient Ph.D.s. What's the best way to go about it?

It is obvious at this point that a brute force string search is going to be terribly inefficient. This type of search would require you to perform a string comparison at every single nucleotide in every genome in your database. Testing a long fragment that has a high hit rate of partial matches would make your client/server system look like an antique batch processing machine. Your challenge is to come up with an efficient string matching solution.

The intuitive solution

Since the database that you are testing against is invariant, preprocessing it to simplify the search seems like a good idea. One preprocessing approach is to build a search trie. For searching through input text, a straightforward approach to a search trie yields a thing called a suffix trie. (The suffix trie is just one step away from my final destination, the suffix tree.) A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. The word 'suffix' is used in this case to refer to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome.)
 

Karp-Rabin string searching

#define B 131

char *search( pat, text )
char *pat, *text;

{ int hpat, htext, Bm, j, m;

if( pat[0]==EOS ) return( text );
Bm = 1;
hpat = htext = 0;

for( m=0; text[m] != EOS && pat[m] != EOS; m++ ) {
Bm *= B;
hpat = hpat*B + pat[m];
htext = htext*B + text[m];
}

if( text[m]==EOS && pat[m]!=EOS ) return( NULL );

for( j=m; TRUE; j++ ) {
if( hpat==htext && strncmp(text+j-m,pat,m)==0 )
return( text+j-m );
if( text[j]==EOS ) return( NULL );
htext = htext*B - text[j-m]*Bm + text[j];
}
 

String Searching by Block Sorting

The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:

    for i = 0 to len(haystack)-len(needle):
        matched = YES
        for j = 0 to len(needle)-1:
            if haystack[i+j] != needle[j]:
                matched = NO
                break
            end if
        end for
        if matched = YES:
            return i
    end for
    return "not found"

This has time complexity O(H*N) (where H is the length of the haystack, and N is the length of the needle). Of course, you don't usually need to test all the characters of the needle for each value of i, but it can happen in the worst case (consider searching for the needle "aaaaaaaaaaab" in a haystack made of one million repetitions of "a").

A well-known optimisation is to start by converting the needle into a finite state machine. This reduces the cost of the search to O(H), independent of N, at the expense of a setup phase that does depend on N. For a single fixed needle and a huge amount of haystack text, this is a worthwhile tradeoff.

On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?

Algorithm Description

Construct the list of all terminal substrings of the haystack. (A terminal substring is one which appears at the end of the haystack. In other words, a substring you can get by taking characters off the beginning.) For example, with the haystack string "foobar", the terminal substrings would be

    foobar
    oobar
    obar
    bar
    ar
    r

Now sort this list into order, so we get:

    ar
    bar
    foobar
    obar
    oobar
    r

This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!

The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)

The setup phase consists of a sort. Sorting is well known to be possible in O(n log n) comparisons, but when the comparisons are between strings, the cost of each comparison is non-trivial as well. In the absolute worst case (every character in the haystack is identical), a comparison would take O(H) time, so the setup phase could take O(H^2 log H) in the worst case. On the other hand, any reasonably normal haystack is likely to take not much more than O(H log H) to sort.

The searching procedure is similar. A binary search takes O(log n) comparisons, so in the worst case the search can take O(N log H). Assuming the needle is reasonably small, this seems entirely acceptable for the massive gain of not having an O(H) term!

Applications

No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.

(back to algorithms index)

 

New and Faster Filters for Multiple Approximate String Matching (ResearchIndex)

AGREP, an approximate GREP ... Implementation of Sunday's Optimal Mismatch Algorithm ... of Xaa: Animationof String Searching Algorithms ...

[PDF] Mutable Strings in Java: Design, Implementation and Lightweight ...
File Format: PDF/Adobe Acrobat - View as HTML
... ment a few relaxed versions of Daniel Sunday’s QuickSearch [9], a variant of the
Boyer–Moore algorithm. ... applied them to the text-search problem ...

PDF] Fast String Searching
File Format: PDF/Adobe Acrobat - View as HTML
... Hill, NJ 07974, USA AND DANIEL SUNDAY Johns Hopkins ... SUMMARY Since the Boyer-Moore
algorithm was described ... benchmark for the practical string search literature. ...

[PDF] Algorithms for String Searching On A Beowulf Cluster
File Format: PDF/Adobe Acrobat
... Daniel Vaz and Hugues Treiber need a very special mention for ... Baker’s Boyer
Moore-type Algorithm [37] • Sunday’s Substring Search Algorithm [38] 2.5 ...

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

**** Pattern Matching Pointers -- very good

NIST String matching

Dick Grune's Annotated Literature Lists

string matching algorithms

The Prague Stringology Club

Algebraic Combinatorics on Words

EXACT STRING MATCHING ALGORITHMS

[PDF] COSC 348 String matching

Introduction to PL-I, Algorithms, and Structured Programming, 3rd Edition

Teaching Binary Tree Algorithms by Programming Proof and Animation

Quick string search

Quick Search algorithm

Horspool algorithm

Citations- Improved string searching - Baeza-Yates (ResearchIndex)

Lecture Notes

String Searching

 

 

 

Knuth-Morris-Pratt algorithm

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt algorithm - Encyclopedia Fablis articles about ...

Knuth-Morris-Pratt algorithm - Wikipedia, the free encyclopedia

NIST: Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt algorithm - encyclopedia article about Knuth ...

Knuth-Morris-Pratt Algorithm

Boyer-Moore string search algorithm

 

String searching algorithm - Wikipedia, the free encyclopedia

Boyer-Moore String Matching over Ziv-Lempel Compressed Text ...

The Boyer-Moore Fast String Searching Algorithm

This algorithm, which Bob Boyer and I invented in about 1975, is the basis of the fastest known ways to find one string of characters in another. How might you look for the following pattern in the text below?
pattern: EXAMPLE
   text: HERE IS A SIMPLE EXAMPLE
We can show you the Knuth-Pratt-Morris algorithm and we can show you the Boyer-Moore algorithm.

Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes. The algorithm is described in

Frodo's Workshop - Boyer-Moore Algorithm

NIST: Boyer-Moore

The Boyer-Moore Algorithm

Data Structures and Algorithms II Lecture 3- String Search ...

Horspool algorithm

Boyer-Moore-Horspool String Searching

Sunday String Search algorithm

sunday

SUNDAY QUICKSEARCH ALGORITHM
By Leonard Morgenstern nleonard@aol.com

 Forth Scientific Library Algorithm #41

Adapted from Daniel Sunday. Communications of the ACM 33:132 1990.

QUICKSEARCH ( c-addrp up c-addrt  ut -- c-addr)
Locate the first instance of a search-pattern with coordinates c-addrp up within a
text-string c-addrt ut. Return the address of the first match in the text if found,
or NIL if not, where NIL is a user-defined impossible address. An ambiguous
 condition exists if the search-pattern (up) is longer than the text-string (ut).

The search algorithm

Before beginning the search, the algorithm constructs an array called a
 shift-table, having the same number of cells as the number of possible characters,
 normally 256. The contained quantity is up +1 (where up is the length of the
 search-pattern) for characters not present in the search-pattern, and the number
 of characters from the end of the string for characters that are present, the
 final character in the string being counted as 1. If duplicates are present, the
 lowest number is used. For example, if the pattern string is "THAT", then T=1,
 A=2, H=3, and all others are 5. If two characters are equivalent, they are
 assigned the same shift-values; in the example, T and t would both have the value
 1 if case is to be ignored. The shift-table is the only data structure needed by
 the algorithm, except, of course, the text and pattern strings themselves.

 At the start, the first character of the pattern is positioned on the first
 character of the text and up characters are compared. (The choice of a comparison
 algorithm will be discussed below.) If there is a mismatch, the text character
 just beyond the pattern is examined, and the corresponding shift obtained from the
 shift-table. The pattern is then advanced that number of characters. The process
 is repeated until the pattern is matched or the text-string is exhausted.

An important characteristic of Sunday's algorithm is that the distance advanced
 depends only on the text character just beyond the search-pattern; it does not
 depend on the character that failed to match, its position, or the order in which
 the pattern and text are compared. The algorithm may search front-to-back,
 back-to-front, or a scrambled order based on letter-frequencies. Naturally, the
 choice affects performance, as will be discussed in the next section.

The comparison algorithm

The algorIthm for comparing the search-pattern and text (the "comparator") is a
 deferred word with the stack diagram:

COMPARATOR ( c-addrp up c-addrt ut -- n)
where n=0 for a match and non-zero for a mismatch. The stack diagram is the same
 as that of the word COMPARE in the ANSI Standard String Word Set.

COMPARATOR should be carefully selected to optimize performance. The standard word
 COMPARE is a good choice in most cases. Situations where another comparator should
 be used include: 1) when certain sets of characters are regarded as equivalent,
 for example, when case is to be ignored; 2) where wild-cards are allowed; and 3)
 where character frequencies dictate a more efficient search order. In general, the
 search is fastest when rare characters are matched first.

Efficiency

The efficiency of the search can be measured by the number of comparisons. The most
 favorable case is seen when the first character in the pattern is never present in
 the text and the text character just beyond the pattern never happens to be
 present in the pattern; the number of comparisons is then ut/(up+1). At the other
 extreme, there are degenerate cases in which the number of comparisons approaches
 ut*up. This would occur when using a front-to-back comparator if the text and
 search patterns contain long runs of the final character in the search pattern,
 for example, searching for AAAABA in AAA...AAAAABA. This is not merely a
 theoretical problem, as it might occur in searching for a string embedded in
 spaces, or in searching graphics files. In the example, performance would be
 improved by matching the odd character first, but the number of comparisons could
 not be reduced much below nt. If there is a significant chance that patterns like
 these might occur, Sunday's algorithm probably should not be used.

Requirements
This is an ANS Forth program requiring:
1. A standard system with Core and Core extension sets.
2. String extension set if COMPARE is used to compare strings.
                3. Uses the words V: and DEFINES from the FSL utilities to
     manage vectored words
                4. The compilation of the test code is controlled by the
                     VALUE TEST-CODE? and the conditional compilation words in
                     the Programming-Tools wordset
                5. The test code uses the String wordset word /STRING
                6. The test code uses the (non ANS) words:
                        TAB to emit a tab character and,
                        DUP>R which is equivalent to:
                        : DUP>R   DUP >R ;
                       
    
  This code is released to the public domain. Leonard Morgenstern, March 1995

[THEN]

\ F-PC IMPLEMENTATION OF SUNDAY QUICKSEARCH

0 [IF]

\ PRELUDE to convert F-PC to ANSI
: CELLS 2* ;
: CELL+ 2+ ;
: VALUE 0 VALUE ;
' =: ALIAS TO           IMMEDIATE
' ASCII ALIAS [CHAR]    IMMEDIATE
' " ALIAS S"            IMMEDIATE
: COMPARE ( c-addr1 u1 c-addr2 u2 -- n)
        rot min COMP ;
\ END OF PRELUDE

INCLUDE QUIKSRCH

[THEN]

1 CELLS CONSTANT CELL

\ SUNDAY QUICKSEARCH ALGORITHM
\ ANSI STANDARD
\ REQUIRES CORE, CORE EXTENSION, AND STRING EXTENSION

\
\ SETTING PARAMETERS
\

-1 CONSTANT NIL
\ Impossible address; returned if no match found

256 CONSTANT #SYMBOLS
\ Number of possible symbols

CREATE SHIFT-TABLE #SYMBOLS CELLS ALLOT \ Make shift-table
V: COMPARATOR
' COMPARE DEFINES COMPARATOR   \ Default COMPARATOR.

\
\ Initializing the shift table.
\
: INIT-SHIFT-TABLE ( c-addr1 u -- )
   
   DUP 1+
       SHIFT-TABLE #SYMBOLS CELLS OVER + SWAP
       DO      DUP I ! CELL +LOOP
       DROP
       DUP 0 2SWAP OVER + SWAP
       DO
               I C@ CELLS SHIFT-TABLE +
               >R 2DUP - R> !
               1+
       LOOP
       2DROP
;
\
\ QUICKSEARCH
\

\ In this version, certain quantities are stored in VALUEs
\ For many applications, local variables would be better
0 VALUE QS-TA     0 VALUE QS-TL
0 VALUE QS-PA     0 VALUE QS-PL

: QUICKSEARCH ( c-addr[p] u[p] c-addr[t] u[t] -- c-addr[m])
\ Enter with addr & len of pattern & text
\ Return address of match or NIL
       TO QS-TL TO QS-TA \ Store text coordinates
       2DUP INIT-SHIFT-TABLE \ Initialize shift-table
       TO QS-PL TO QS-PA       \ Store pattern coordinates
       NIL \ NIL on stack
       QS-TA QS-TL QS-PL - 1+  \ Maximum number of comparisons
       OVER + SWAP \ Substitute BOUNDS if defined
       DO
               I QS-PL QS-PA QS-PL
               COMPARATOR 0= \ Make a comparison
               IF
                       DROP I LEAVE
\ Leave if found
               THEN
               I QS-PL + C@ \ Extract char just past pattern
               CELLS SHIFT-TABLE + @
\ Extract shift
       +LOOP
;

\ END OF ALGORITHM



\ BELOW HERE ARE SEVERAL TEST UTILITIES, ETC.
\ =======================================================================

TEST-CODE? [IF]     \ test code =============================================

: DUMP-ST ( DUMP SHIFT TABLE)
       #SYMBOLS 0
       DO
          I 16 MOD 0= IF CR  I EMIT SPACE THEN
          I CELLS SHIFT-TABLE + @ . SPACE
          I 128 MOD 0= IF KEY DROP THEN
       LOOP
;

CREATE pattern$ 40 CHARS ALLOT
CREATE text$ 40 CHARS ALLOT

: Show-result ( a -- ) \ After performing search show result
    CR 1 tab
    DUP nil =
    IF
        DROP ." NO MATCH!"
    ELSE
        >R              \           \ stash match addr on r.s.
        QS-TA QS-TL     \ ta tl     \ get coordinates of text
        OVER R> OVER -  \ ta tl ta len1   \ get coordx of 1st part
        dup>r
        TYPE [CHAR] ^ EMIT \ ta tl
        R> /STRING      \ a' l'     \ index to match
        OVER QS-PL
        TYPE [CHAR] ^ EMIT \ a' l'  \ type matching part
        QS-PL /STRING TYPE  \       \ type tail
    THEN
    CR
;
: QTEST ( -- a )  \ Ask for strings, compare, & return result
    CR ." Look for: "
    pattern$ 40 EXPECT pattern$ SPAN @
    CR ."       in: "
    text$    40 EXPECT text$ SPAN @
    QUICKSEARCH
    Show-result
;

[THEN]
 

Dictionary of Algorithms and Data Structures ... [Sund98] Daniel M. Sunday, A Very Fast Substring Search Algorithm, Communications of the ACM, 33(8):132-142, August 1998. [Vitt01 ...
www.nist.gov/dads/ - 82k - Nov 13, 2004


stbm.c ... 10, October 1977 pp. 762-772 */ /* * A Very Fast Substring Search Algorithm" */ /*
Daniel M. Sunday */ /* Communications of the ACM */ /* Vol. 33 No. ...  www.grouse.com.au/ggrep/stbm.c.html - 33k

[PDF] COSC 348 String matching File Format: PDF/Adobe Acrobat - View as HTML ... MR 89g:68021 [Sun90] Daniel Sunday, A very fast substring search algorithm, Communications of the Association for Computing Machinery 33 (1990), no. ... Volume 33 ,  Issue 8  (August 1990) table of contents
Pages: 132 - 142   Year of Publication: 1990 ISSN:0001-0782 Daniel M. Sunday  East Stroudsburg Univ. of Pennsylvania, East Stroudsburg, PA Publisher: ACM Press   New York, NY, USA

This article describes a substring search algorithm that is faster than the Boyer-Moore algorithm. This algorithm does not depend on scanning the pattern string in any particular order. Three variations of the algorithm are given that use three different pattern scan orders. These include: (1) a “Quick Search” algorithm; (2) a “Maximal Shift” and (3) an “Optimal Mismatch” algorithm.

Etc

A Fast String Scanning Algorithm with Small Startup Overhead Thomas Wang, October 1999

 
last update November 1999

Abstract

A fast substring search algorithm is presented here. This algorithm is faster than a simple left to right scanning algorithm. Startup overhead is taken into consideration to avoid the use of a large lookup table. This algorithm is easily implementable in Java.

Introduction

The Java class 'String' provides substring searching capabilities. The method String.indexOf(pattern_string) will search for the location of the first occurrence of pattern string in the source string. For example, "Who is Tomas".indexOf("Tomas") would return 7.

Typical implementation algorithm would search for the character 'T' from left to right. Once 'T' is found, then 'o', 'm', 'a', and 's' would be matched. If a mismatch happens, then we have to resume searching for 'T' until we hit the end of the search string.

This is pseudo code implementing the simple algorithm.

outer_loop:
for indx = 0 to length(source) - length(pattern)
{
  // match first pattern character
  if pattern[0] == source[indx] then
  {
    for indx2 = 1 to length(pattern) - 1
    {
      // match subsequent pattern characters
      if pattern[indx2] != source[indx + indx2] then
        continue outer_loop;
    }
    return indx; // everything matched
  }
}
return did_not_match;

We can do much better than the above algorithm.

Looking For Impossible Characters

The key to a faster scanning algorithm is to look at the last character of the pattern string first. If the last pattern character matches, then we continue to search for pattern characters from left to right, until the second last pattern character. (remember the last pattern character already matches)

"Who is Tomas".indexOf("Tomas")

    v
Who i........         <- source
Tomas                 <- pattern
    ^

In the above example, we found a character 'i' where a character 's' is required to make a match. Because 'i' does not occur anywhere in the string 'Tomas', we can advance the search position 5 characters! While we are matching the last character of the pattern string, if we found an "impossible character" in the search string, we are safe to advance the search pointer by the length of the pattern string.

How do we program this test? We can use a 64 bit long integer for a quick test.

// initialization
long cache = 0L;
for i = 0 to length(pattern) - 1 {
  cache |= 1L << (pattern[i] & 63);
}

...

// test for last pattern character; pattern string is "Tomas"
if (source[i] != 's') { // last character did not match
  if ((cache & (1L << (source[i] & 63))) == 0L) { // got an impossible char
    i += 5; // skip 5 characters
  }
}

Being able to skip 5 characters instead of 1 character at a time gives us a speed boost.

If the implementation language does not have good support for 64 bit integer arithmetic, a 32 bit integer cache can be used in a pinch.

Search Order Discussion

After the last character of the pattern string matches with the source string character under the scan pointer, we scan the rest of the pattern string from left to right. The question is 'why'? Some other choices include right to left, and random order. The answer has to do with worst case performance estimate.

If we find an "impossible character" in the subsequent pattern string matches, we can still skip more than one characters. For example, if we find the fifth character is "impossible", then we can increase search pointer by 5.

stre stress     <- source
stress          <- pattern

234561          <- match order


    v
stre stress     <- source
stress          <- pattern
    ^

By scanning from left to right, we ensure the amount we skip is roughly proportional to the number of characters we scanned. If we scan from right to left, then it is possible for a long scan to yield a small skip amount.

MD2 Optimization

Optimization of md2 offset is used when the last character of the pattern string matches, but some other pattern characters do not match.

Take the example of "James Tomas".indexOf("Tomas")

In this example, the 5th character of the search string is 's', so it matches the last character of the pattern string. We then match 'J' against 'T', which does not match. At this point, we can again skip 5 characters! Why?

This is because in the string "Tomas", 's' only occurs as the last character and no where else. We know the 5th character is 's' but the front part did not match. If we advance the search position one character, then the 's' character would be eventually matched against the 4th character of the pattern string ('a'). But we already said 's' does not occur in the pattern string from the first position up to the 4th position.

    v
James....       <- source
 Tomas          <- pattern
    ^

So before we do anything we can already predict any shift less than 5 would produce an ultimate mismatch.

The safe shift distance for subsequent mismatches is called 'md2'. One can calculate md2 by looking for 's' starting from the 2nd last pattern character to the first pattern character

For the pattern string "Tomss", md2 would be 1.
For the pattern string "Tosas", md2 would be 2.
For the pattern string "Tsmas", md2 would be 3.
For the pattern string "somas", md2 would be 4.

Pseudo Code for Full Implementation

// initialize the cache
long cache = 0L;
for i = 0 to length(pattern) - 1
{
  cache |= 1L << (pattern[i] & 63);
}

// calculate md2
last_pattern = pattern[ length(pattern) - 1 ];
md2 = length(pattern)
for i = 0 to length(pattern) - 1
{
  if last_pattern == pattern[i] then
  {
    md2 = length(pattern) - i;
  }
}

scan_loop:
for i = length(pattern) - 1 to length(source) - 1
{
  if last_pattern == source[i] then
  {
    // last character matched, match the rest

    for i2 = 0 to length(pattern) - 2
    {
      if pattern[i2] != source[i - length(pattern) + 1] then
      {
        // see if character under cursor is "impossible".
        if ((cache & (1L << (source[i] & 63))) == 0L)
          altskip = i2 + 1;
        else
          altskip = 1;
        // skip the maximum of md2 and impossible calculation
        i += max(md2, altskip);
        continue scan_loop;
      }
    }
    return i - length(pattern) + 1; // everything matched
  }
  if ((cache & (1L << (source[i] & 63))) == 0L)
    i += length(pattern); // got an impossible character
  else
    i += 1;
}
return not_found;

Performance Estimate

The new string searching mechanism can potentially speed up string searching by 'p' times, where p is the length of the pattern string. The down side would be that there is pre-processing cost of 2p instructions, calculating the impossible bitmap, and md2. When we discovered two instances of impossible characters in the search string, then we should be even in processing time.

This performance profile compares favorably with traditional fast substring scanning algorithms, such as Boyer Moore. Since Java's character is 16 bit wide, a naive implementation of Boyer Moore would require 65536 machine instructions to initialize its lookup table. This is obviously not acceptable when we are scanning any string shorter than 70000 characters long.

References

Timo Raita, "On Guards and Symbol Dependencies in Substring Search", Software-Practice and Experience 29(11), pg 931-941 (1999)

Daniel M. Sunday, "A Very Fast Substring Search Algorithm", CACM 33(8), pg 133-142 (1990)

R. Boyer and S. Moore, "A Fast String Searching Algorithm", CACM

 

Rabin-Karp string search algorithm

 


Copyright © 1996-2008 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: November 08, 2008