String (character sequence) matching algorithms (also known as string search) are one of the core class of algorithms in computer science. Works on this category of algorithms started in the early 1950s. The main goal of a string matching algorithm is to find whether a small string belongs to a larger string, as fast as possible. It's like a game that you can play with your friends - finding a sentence in a large book, and of course, smarter ones win. You could apply different techniques. You could act lazy and try matching each character of the sentence that you're given, right from the beginning of the book (in computer science, this is called, a brute force method). Or you could act intelligently by guessing the pages in the book that might match the string by its context, thus reducing the search space. The way you attach will diversify the strategy you choose. Here's a timeline that plots some major milestones on this topic.
In case you're wondering why string matching is a big deal, here are some useful applications - spell checkers, spam filters, malicious code detection, content searching in large databases (like search engines), information retrieval, plagiarism detection, DNA sequencing and so on. The list don't stop here. We will see more examples as we progress.
pattern matching #
We've played an exciting string matching game in the previous section. But it belongs to a specific category called
exact string matching. What we're going to do now is, to generalize these matching algorithms, and call it pattern matching. Through pattern matching, we will be searching for a sequence (or a set of sequences) of tokens that having specific properties. For example (extremely vague), suppose Jubin Jose and William Shakespeare are decided to write a single page drama on the same topic. After writing, you will be invited to read both of them and figure out which drama being written by Shakespeare. It wouldn't be a big challenge for you to figure it out. But how you did it? Aha.. Your gut told you so? Yeah, that gut feeling's pattern matching. And a token can be anything very basic, that makes a sequence - a sequence of tokens; just like characters makes words, words makes sentances, sentances makes paragraph and Nucleotides makes DNA.
As I have mentioned, this is a very vast topic and out of context of our blog series. We will be moving through a narrowed road inside this widespread concept. You will recognize whenever we enter its territory because you've got introduced to it.
And I would like to mention that, both Machine and Human, love patterns.
regular expressions #
You might say: "That ended very quickly. We haven't touched anything specifically. It's disappointing.." I know it was rude. Good news is, we're moving towards it and will reach it gradually. For now, we're going to talk about regex. Let this be the first step towards it.
Regular expressions (Regex) are a sequence of characters that define a search pattern. Regex searching is a pattern matching technique usually applied to a sequence of characters (string). It is used to
classify strings based on some rules. A regex processor translates a regular expression into an internal representation which can be executed and matched against a string. In the upcoming article, We will learn to write basic regex rules and will apply it over some text. And in the final article, we have something exciting to play with. Have you heard of ELIZA chatbot? What about building one? Till then.