Throughout this project regular expressions have formed the foundation of user input validation. A regular expression is a pattern with which a programmer can specify the exact syntax of valid input for a particular portion of a program. Perl's regular expression capabilities are amongst the best of any programming language out there, due in part to it's focus on text processing. By using these regular expressions CsvSQL can validate each user inputted command and ensure that it meets the particular criteria associated with that command before passing it on to the subroutine that will use the values in a command to manipulate the actual file data.
In fact, not only do the regular expressions used here check the syntax of user input, they also pick out the actual conditions and expressions that will be used by various subroutines, and assign these expressions to variable names that will later be passed to these subroutines. This is done by capturing expressions which fall inside parenthesis and assigning their value to Perl defined variables. These captured expressions will form the parameters for CsvSQL's subroutines. The grouping methodology will be explained in more detail later on.
The only way to completely explain a regular expression is with an example. As such, below is an example of one of the four regular expressions used to check the syntax of the SELECT statement before the command is allowed execute.
Example 4-6. WHERE Statement Regular Expression
m/\s*((?:.+?)|"(?:.+?)")\s+where\s+(.+?)\s*([<>!]=|[<>=])\s*(.+?)\s+
(and|or)\s+(.+?)\s*([<>!]=|[<>=])\s*(.+?)$/i
We will start off with the beginning of the statement which is m/\s*
The m in the above line specifies that Perl should match the following statement. There are other variations of this, many of which are used in this program. For instance, an s would tell Perl that we want to substitute the first portion of a regular expression with a second part. The / is the starting delimiter indicating that what follows is a regular expression.
The \s means that we should expect a space, and the * means zero or more of what precedes it. So in this case at the start of the string can contain zero or more spaces.
We will take the first half of the second chunk of this regular expression which is (?:.+?). Note that both the first and second half of this expression are inside parenthesis, which we will explain later.
This first portion of the regular expression can then be broken down into the surrounding parenthesis (...), the ?: , . , + , and ?.
The (...) represents a group of expressions. Groups of expressions can be accessed using default variable names assigned to them by Perl. These variable names are $1, $2, $3 and so on. The ?: however marks this first group as a passive group which means it won't be captured in a variable.
The . represents any character except for the newline character (\n). The + means that we are expecting one or more of the preceeding character, or group, which in this case is the . character. Finally the ? after a quantifier () signifies that we want a minimal match, to match the minimum possible amount of characters.
So in total we are looking for a group of any characters except the newline character, with at least one character value, with a minimal match so as to match the minimum number of possible matches.
The second part of this chunk is "(?:.+?)"
This regular expression represents the same as the first part of the chunk, except that it will include "'s on either side of the character(s). In which case we will be searching for the same expression as before, except surrounded by "'s.
Both of the last two expressions are encaplulated in the one group denoted by the (...) surrounding the expression. As this group does not include a ?: it will capture the expression within the parenthesis and assign it to the $1 variable. The | symbol in the middle of the group denotes that we require either the expression on the left side of the |, or the expression on the right, but not both. The difference here being that the right side expression is surrounded by "'s whereas the left side is not.
The next portion of the expression is \s+where\s+. As we already know \s represents a space character, and + represents at least one or more space characters. The word where is a literal match looking for a string which matches exactly. The \s* means we will allow 0 or more space characters.
The (.+?) means that we require one or more characters (as long as it is not a newline character). This expression will then be assigned to the $2 variable.
The \s* once again means 0 or more spaces. The following expression, ([<>!]=|[<>=]) is a character set. We will first split this into it's two constituent parts found on either side of the | symbol. Where | means we can match one part or the other, but not both. The [...] means that we are required to match any one of these characters, but not more than one. In the first [...] we have [<>!] followed by an =. This means we will attempt to match either <=,>= or !=. The = is common to all matches as it is placed outside the [], whereas out of the other symbols only one can match at a time to return a match, yet the expression must match at least one of the symbols.
The right hand side of the | is slightly simpler. In this case we have only three possible choices with no common elemtents. If we can mateh either < , > , or = we have a match. This entire segment is inside parenthesis so whichever side of the expressions is a successful match will be assigned to the $3 variable.
The \s*(.+?)\s+, as explained earlier simply means that we are looking for 0 or more spaces, followed by one or more of any character except the newline character, followed by one or more spaces. The matched expression will be assigned to the $4 variable.
At this point we have completed the validation for what would be the file_name WHERE col_name = expr portion of a CsvSQL SELECT command. The next portion of our regular expression is a simple literal group represented by (and|or). This expression will match either an and or an or string, and assign the matched expression to the $5 variable.
The rest of the expressions we have seen before. The \s+(.?)\s*([<>!]=|[<>=])\s*(.+?) expression will attempt to match one or more space characters, followed by one or more non-newline characters ( the result of which will be assigned to $6), followed by 0 or more space characters. It will then attempt to match either <= , >= , != , < , > , or =. Whichever of these is a match will be assigned to the $7 variable. The expression will then attempt to match 0 or more space characters, followed once again by one or more non-newline characters. This last group will be assigned to $8
The final part of this regular expression is $/i. The $ holds a special status in regular expressions. It signifies the end of a string. In this case it is saying that if the string does not end with the last group of characters it is invalid. The / is the delimiter showing that we are at the end of a regular expression. The trailing i is known as a pattern modifier. This particular modifier makes the preceeding regular expression look for a case insensitive match. This allows users to enter either SELECT or select as is their preference, and CsvSQL will accept either version.