I decided to use a hash of arrays as the data structure to hold the CSV data. The reason for this was that using a hash (or associative array) allowed me to use column names as keys, with each column's data then being accessed via the array which is that keys value.
To populate the hash of arrays CsvSQL will call the readfile() subprocedure which will perform all the tasks necessary to read in a file and place it's contents into the correct fields of the hash of arrays. readfile() will first check to see if the file to be opened is already in memory by checking the current file name and comparing it against the proposed file.
Once CsvSQL has confirmed the file is not already open it will attempt to open and lock the file. Should it succeed it will initially only process the first line. This first line is split based on either the default separator of "," or a user defined separator, and each value is put into a temporary array. If either the file open or lock fails the user will be informed.
readfile() will now iterate through the temporary array placing each value into the keys of the hash that will hold the arrays. As each key is in turn assigned an element, an empty array will be initialised as it's value to contain the rest of the data. As hashes do not maintain the order in which their elements were inserted it is necessary to create an @keys array which will maintain the order of the hashes elements. From this step onwards all iteration through the hash is done through the @keys array, for ordering purposes.
At this point the readfile() subprocedure will begin reading the rest of the file one line at a time. Each line is once again split into an array, with each value inserted into one of the arrays in the hash. To do this readfile() will take the first element from the temporary array, and insert it into the element at index 0 of the array that is the element of the hash which has the same value as the first element of the @keys array. A predefined counter will now be incremented, and then the process will continue by inserting the next element in the temporary array into the element at index 1 of the relevant hash element, as defined this time by the second element in the @keys array, and so on.
There are a few items of this process that will be important during the documentation of accessing hashes of arrays. During this process an array called @size is created which has the current line number appended to it as each line of the file is processed. It is also important to note that throughout this process it is necessary to strip the various end of line characters used by differend operating systems. To do this CsvSQL strips the character values \012 and \015 from the end of each line as it processes it.
Accessing a defined hash of arrays can be quite problematic, but has certain similarities to populating a hash. However once we start restricting what columns we want to access we are faced with a new problem. As there are three methods of restriction in CsvSQL there is a different approach for each method. In the following sections we will describe the differences, however for the basics of accessing a hash we will use a case of the user wanting to do a simple SELECT command such as SELECT * FROM data.csv.
In this case the selectdata() subroutine is called. The select subroutine will take the in the arguments of * columns from the file data.csv and determine that the user wants all columns from the file. In this instance it is a simple case of iterating through the @keys array and iterating through each array in turn one index at a time for length N, where N is the last element in @size, and displaying the results.
In order to acess a hash of arrays by restricting on a LIMIT parameter we will need to take a user defined value and create an array ( @limit ) populated with the first N elements of the array @size where N is the user entered value and @size is the array containing line numbers. Once this it done we will just display the results using the values in @limit as opposed to @size which contains the full listing. If the specified limit is greater than the file size the entire listing is printed.
Restricting a hash by a colum name requires the user to enter the desired column's names. Once entered these values are placed into an array ( @cols ). The first thing to do is to check that the supplied values are valid column names. To do so CsvSQL passes the @cols and @keys arrays through to a different algorithm called issubsetof() which will check that every element in @cols exists in @keys.
Once the columns are valid then CsvSQL will iterate through the @cols array and it's associated arrays as opposed to the complete set of values contained in @keys and thus only return the values with from the column name(s) specified.
The WHERE clause is just another form of a restrict, though more complex than those above. A WHERE can be used with either a SELECT, UPDATE or DELETE, and there are three kinds of WHERE clause. The first is a simple single WHERE, the second an AND, and the third an OR.
A single WHERE clause will search for any fields in a single column matching a user defined value. Both the column name and value are to be entered by the user. To first validate that the column exists CsvSQL will call the iselementof() subprocedure which will take a column name, and the @keys array as it's parameters. If the column name exists in the @keys array then it is a valid column and the process can continue.
Unfortunately due to the nature of the data, and the necessity of it being ordered in a consistant manner, CsvSQL cannot sort the data and must run a linear search which will run in O(N) time. It is also necessary to search for all occurances of a value, so the search algorithm used, which for this instance of WHERE would be the singlesearch() subroutine, will continue on to the end of the target array in order to ensure we find all instances of a value. singlesearch() will then return the list of indexes which matched the value being searched for, so CsvSQL will only display those elements which are in the correct indexes of the columns the user is looking for.
If a WHERE with an AND is required CsvSQL will pass the column and value paramaters for each side of the AND through to the multisearch() algorithm. The multisearch() will search through the first column defined by the user input for the values required. These values indices are then recorded in an array, which the multisearch will use as a pre-restricted list through which to search the next user defined column for a particular value. The end result is an array that contains the index values of all "rows" in the hash of arrays where both conditions are met.
If a user requires a WHERE clause with an OR it is necessary to slightly change the approach. In this case CsvSQL will pass the parameters on either side of the OR one after the other. This would result in singlesearch() being called once for the left side of the OR, and once for the right side, with the results of each query stored in in a separate array. Once both searches are done the results of the second search array are appended to the first. CsvSQL will then strip out the duplicates leaving a single array containing the index values of those elements to be displayed.
In order to determine if columns requested by a user as part of the SELECT statement's column list are valid columns it is necessary to ensure that the columns exist. To do so CsvSQL will call the issubsetof() subprocedure. This subprocedure takes two parameters, the first an array containing the columns required, and the second the @keys array which contains the list of all columns. If each element in the array containing required column names is contained in the @keys array then the subprocedure will return true and the column list is valid, otherwise it will return false.
The iselementof() subprocedure works in a similar fashion, albeit a simpler form. This subroutine takes just a single scalar value, and a single array and performs much the same actions, and as such will not be documented here.
In order to search for an element CsvSQL will use the singlesearch() subprocedure. As CsvSQL can use not only the = operator, but also <=, >=, != <, and >, the search needs to take this into account. A singlesearch() will take as it's parameters a column name, an element name, and an operator. The operator will determine in what fashion the search will be conducted. The search itself will iterate through the array indicated by the column name passed to it. singlesearch() will return an array of indices where it has found a match. This search function will run in O(N) time. This is unavoidable as the data needs to be kept in the same order as that in which it was read, and any sort procedure that might be run on the array would have to be undone before the program could continue making void any temporary improvements in speed.
To search for multiple items where an AND parameter has been passed CsvSQL will call the multisearch() subprocedure. multisearch() will take as it's parameters the column names, values, and operators of both sides of the AND parameter. multisearch() will first search for one set of values, much like singlesearch does, and then perform the same actions on the second group of parameters, returning the combined results of the two searches as an array.