CsvSQL

Manipulate CSV files using an SQL Interface

Killian Faughnan

The motivation behind the writing of this software was a simple one. I have often used CSV (Common Separated Value) files for work purposes, and consistantly found grep and egrep to be lacking in the ability to retrieve the information I required. Conversly I found awk to be too much. I'm sure there are many staunch awk advocates who would find that a blasphemous statement, however, while awk is perfect for many tasks, in this case I just found it to be excessive.

To this effect I started work on CsvSQL. I endeavoured to create a simple way to access specific entries in large files, based on any associative fields they might have, and to do so in a simple manner. To achieve this I decided upon using a subset of SQL as the command syntax structure, as this would already be familiar within the target user group of Systems Administrators.

CsvSQL is designed to read the contents of a CSV file into an organised hash (or associative array) of arrays. This allows it to correctly manipulate the data without corrupting it, before writing it back out disk.

In essence CsvSQL is a data manipulation program which can accurately manipulate CSV files without corrupting the data within. It is a more optimised method by which to manipulate data in a CSV file than traditional, more unwieldy tools.

A copy of the source code for this project can be found at http://www.killianfaughnan.com along with a pdf and online version of this documentation. The source for this project will also be available from CPAN in the near future.

 

If you optimise everything, you will always be unhappy.

 
--Donald Knuth 


Table of Contents
Acknowledgements
1. Introduction
1.1. Overview
1.2. Key Aims of the Application
1.3. Overview of this Document
1.4. Target User Group
2. Background
2.1. Motivation
2.2. Common Separated Value File
2.2.1. CSV Specification
2.2.2. What can CSV files be used for?
2.3. Alternatives to CsvSQL - Microsoft Import-Csv
3. Systems Analysis and Design
3.1. Program Design
3.2. SQL Subset Defined by EBNF
3.3. Analysis of a CsvSQL Command
4. Implementation
4.1. The Perl Language
4.2. Implemented SQL Queries
4.2.1. SELECT Syntax
4.2.2. INSERT Syntax
4.2.3. UPDATE Syntax
4.2.4. DELETE Syntax
4.2.5. CREATE Syntax
4.3. Algorithms
4.3.1. Populating a Hash of Arrays
4.3.2. Accessing a Hash of Arrays
4.3.3. Accessing a Hash of Arrays with a LIMIT
4.3.4. Accessing a Hash of Arrays by Column Name
4.3.5. Restricting on a WHERE Clause
4.3.6. Determining Validity of Requested Columns
4.3.7. Single Search Algorithm
4.3.8. Multiple Search Algorithm
4.4. Regular Expressions
4.5. Difficulties
5. Project Management
5.1. Subversion Source Control
5.2. DocBook
5.3. Packaging
6. Testing
6.1. Introduction
6.2. User Input Validation Testing
6.3. Executed Command Results Testing
6.4. Performance Analysis
7. Conclusions
7.1. Future Development
7.2. Summary
References
Glossary
A. Man Page
List of Figures
3-1. CsvSQL Command Data Flow Diagram
List of Examples
2-1. Sample CSV
2-2. Sample MS Import-Csv
2-3. Sample MS Import-Csv with And
4-1. Sample SELECT
4-2. Sample INSERT
4-3. Sample UPDATE
4-4. Sample UPDATE
4-5. Sample CREATE
4-6. WHERE Statement Regular Expression
1. Awk Example