CSVMODE User's Guide

Author: Manuel Collado, <m-collado@users.sourceforge.net>, http://mcollado.z15.es
CSVMODE Version 0.2a, January 2021
Public domain

Contents

CSVMODE Tutorial
Usage recommendations
CSVMODE Reference

CSVMODE Tutorial

The CSV data format

The Comma-Separated-Values (CSV) data format is commonly used by spreadsheets and database engines to import and export data as plain text files.

A CSV file is a sequence of records separated by newline marks. A CSV record is a sequence of fields separated by commas. A field can contain almost any text. If a field contains commas, newlines or double quotes it must be enclosed in double quotes. Double quotes inside a field must be escaped by doubling them. Example:

author,title,remarks
Shakespeare,A Midsummer Night's Dream,comedy
"Stevenson, Robert Louis",Treasure Island,novel
anonymous,"A ""quoted"" word","remark 1
remark 2"

There are four records, each one with three fields. The field "Stevenson, Robert Louis" is quoted because it contains a comma. The field "A ""quoted"" word" is quoted because it contains escaped quotes (coded as duplicates). The third field of the last record has two lines of text. The data is equivalent to the following table:

author	title	remarks
Shakespeare	A Midsummer Night's Dream	comedy
Stevenson, Robert Louis	Treasure Island	novel
anonymous	A "quoted" word	remark 1 remark 2

Installing the CSVMODE library

Install the distribution archive

Download and unpack the csvmode-x.x.zip (or csvmode-x.x.tgz) distribution archive in a directory of your choice. And either add the csvmode subdirectory to your AWKPATH environment variable, or copy the csvmode*.awk files to a directory already in AWKPATH.

You can see what is your current effective AWKPATH by issuing the command:

gawk 'BEGIN {print ENVIRON["AWKPATH"]}'

On Windows you should type:

gawk "BEGIN {print ENVIRON[\"AWKPATH\"]}"

Prerequisites

The CSVMODE library requires GNU awk (gawk) version 4.1.0 or later. You can see what is your current gawk version by issuing the command:

gawk --version

Using the CSVMODE library

The CSVMODE library provides facilities for:

Process strings containing individual CSV records or fields.
Process whole CSV data files.
Generate CSV formatted records or fields.

The CSVMODE library must be explicitly included with either a -i csvmode option in the command line or a @include "csvmode" directive in the awk script code.

Handling individual CSV fields

A couple of functions can convert between CSV fragments and clean text fragments.

csvquote: <b,"c> --> <"b,""c">
csvunquote: <"b,""c"> --> <b,"c>

Parsing individual CSV records

The csvsplit() function divides a CSV formatted record string into pieces. The CSV fields are stored as elements of an array. Example:

data --> a,"b,c",d
n = csvsplit(data, af)

gives

n = 3
af[1] -> <a>
af[2] -> <"b,c">
af[3] -> <d>

It is possible to handle CSV data that use alternate delimiter or quote characters. For instance, if the record uses semicolons instead of commas to delimit fields, and single quotes instead of double quotes:

data --> a;'b;c';d
n = csvsplit(data, af, ";", "'")

gives

n = 3
af[1] -> <a>
af[2] -> <"b;c">
af[3] -> <d>

Automatic parsing of CSV files

Automatic parsing of CSV data files is controlled by a predefined CSVMODE control variable. If set to a non-zero, non-null value the input data file reader automatically recognizes CSV records and split them into fields as expected. The fields are delivered as $1, $2, ... $NF as usual. The user can choose to get CSV fragments (with CSVMODE=1) or clean text values (with CSVMODE=-1). If CSVMODE is undefined or has a null value, input files are processed in the normal way.

Sample data file:

a,b,c
p,"q,r",s
x,"""y""",z

Awk script for CSV fragments:

@include "csvmode"
BEGIN { CSVMODE = 1 }
{ print $2 }

Result:

b
"q,r"
"""y"""

Awk script for clean text fields:

@include "csvmode"
BEGIN { CSVMODE = -1 }
{ print $2 }

Result:

b
q,r
"y"

The parsing process can be customized in order to accept non-standard CSV data files. A couple of predefined variables can be used to specify the special field delimiter and quoting characters:

CSVSEP: The special character that delimit the fields. By default a comma (,).

CSVQUOTE: The specific character used to quote values. By default a double quote (").

Sample data file:

a;b;c
p;q,r;s
x;'"y"';z

Awk script:

@include "csvmode"
BEGIN { CSVMODE = 1; CSVSEP = ";"; CSVQUOTE = "'" }
{ print $2 }

Result:

b
q,r
'"y"'

The whole CSV record is stored as $0. The format depends on CSVMODE:

CSVMODE=1, keeps the original CSV input record.
CSVMODE=-1, gives a record of clean field values, separated by the SUBSEP character.

In the latter case the original CSV record is not lost, but stored in a predefined CSVRECORD variable. And the user can change the field delimiter by assigning a custom value to a CSVOFS predefined variable.

Sample data file:

a,b,c
p,"q,r",s
x,"""y""",z

Awk script:

@include "csvmode"
BEGIN { CSVMODE = -1; CSVOFS = "|" }
{ print }

Result:

a|b|c
p|q,r|s
x|"y"|z

File processing in the automatic CSVMODE correctly recognizes CSV records with multiline fields. I.e., fields that contain newline characters.

Sample data file:

a,b,c
p,"q
r",s
x,"""y""",z

Awk script:

@include "csvmode"
BEGIN { CSVMODE = -1 }
{ print "<" $2 ">" }

Result:

<b>
<q
r>
<"y">

Generating CSV data

A CSV record can be created two ways:

From an array of clean field values, indexed 1..N.
From a regular record string with fields delimited by a fixed FS-like pattern.

Examples:

f[1] = "007"
f[2] = "Bond, James"
f[3] = "United Kingdom"
csvcompose(f) -> <007,"Bond, James",United Kingdom>

record = "007/Bond, James/United Kingdom"
csvformat(record, "/")  -> <007,"Bond, James",United Kingdom>
csvformat(record, "/", ";")  -> <007;Bond, James;United Kingdom>

A convenient csvprint() function prints the current record as CSV data in any CSVMODE (1, 0, -1).

Sample data file:

a b,c "hello"

Awk script:

@include "csvmode"
{ csvprint() }

Result:

a,"b,c","""hello"""

Usage recommendations

Gawk offers three different input modes, controlled by the FS, FPATH and FIELDWIDTHS variables. The csvmode library offers two additional modes for CSV input files, controlled by CSVMODE=1 and CSVMODE=-1, respectively. It also provides a set of string handling functions to manually handle CSV formatted text.

The appropriate input mode depends on how the data will be processed. The following advises can be of help.

CSVMODE = 0

CSV data can be handled with just the functions provided by the csvmode library. In this case the csv*() functions can be seen as a first-aid facilities. This mode is suitable when most part of the code doesn't depend on the fact that some data are formated as CSV. Or when the CSV notation is used only for output.

CSVMODE = 1

This mode is suitable when the clean text field values are only occasionally needed. For instance, for selecting or reordering some columns. Or assigning new field values that doesn't depend on the value of other fields. Or just extracting some records with patterns based on the CVS notation.

This mode is not suitable if the input and output CSV notations use different delimiter or quote characters.

CSVMODE = -1

This mode is possibly the most powerful one. All computations may be done internally on clean text values. And CSV data is automatically parsed on input and generated on output, if necessary. In addition, this mode allows mixed processing of clean text and CSV files with the same set of pattern-action rules.

The only drawback is probably that some quoting on input can be lost when some field values are reproduced on output. Output CSV fields are quoted only if necessary.

Typical use cases

The usecases/ directory of the csvmode distribution contains a set of simple use cases that can help new users to start using the tool.

CSVMODE Reference

Control variables

The CSVMODE control variable must be set to a non-null value in order to enable CSV input parsing.

CSVMODE = 0  Disable CSV input parsing (default)
        = 1  Enable CSV input parsing
        = -1 Convert CSV input records into regular awk records

CSVRECORD: For CSVMODE=-1, stores the original CSV input record

The value of CSVMODE is checked at BEGINFILE time. If non null, the csvmode library takes control of FS, OFS and FPAT variables during the input file processing. These variables are automatically restored at ENDFILE time. The effect of changing these variables in the middle of the file processing in unpredictable. Changing CSVMODE in the middle of the file has no effect.

If the user wants to set CSVMODE in a BEGINFILE clause, then the relevant BEGINFILE block must be coded before the @include "csvmode" directive.

BEGINFILE {
   CSVMODE = FILENAME ~ /\.csv$/
}
@include "csvmode"

CSVMODE = 1

In this mode a suitable FPAT value is computed and applied. $0 contains the whole CSV input record and $1..$NF contain the CSV field fragments, as given by the FPAT parsing feature.

FPAT is computed according to CSVSEP and CSVQUOTE. Also OFS is temporarily set to CSVOSEP, so assigning new values to individual fields or to the whole record should work as expected.

CSVMODE = -1

In this mode the input CSV record is demangled and converted to a regular awk record. $1..$NF contain the clean values of the fields, and $0 contains the concatenation of these fields, delimited by CSVOFS. Also FS and OFS are temporarily set to CSVOFS, so assigning new values to individual fields or to the whole record should work as expected.

CSVOFS must contain a character not used in the CSV input file. The default SUBSEP character should work in almost all cases.

Optional parameters

CSV data management can be customized by a variety of optional variables. If any variable is not explicitly set, the corresponding default value is used.

CSVSEP    Input field delimiter, default comma (,)
CSVQUOTE  Input quoting character, default double quote (")
CSVTRIM = 0  Preserve all space in fields (default)
        = 1  Discard unquoted leading or trailing space
        = 2  And also collapse contiguous space into a single space
CSVONELINE  Flag, disable multiline fields (default enabled)
CSVOSEP   Output field delimiter, default CSVSEP
CSVOQUOTE Output quoting character, default CSVQUOTE
CSVOFS    Field separator for CSVMODE=-1, default SUBSEP

Public functions

CVS parsing

csvunquote() - Extract the clean value of a CSV field
  value = csvunquote(csvfield [, quote [, trim]])
    csvfield: The CSV formatted field
    quote: Optional, the actual CSV quote character, default CSVQUOTE
    trim: Optional, the whitespace trim policy, default CSVTRIM

csvsplit() - Split a CSV record into an array of CSV fields
  num_fields = csvsplit(csvrecord, afield [, comma [, quote]])
    csvrecord: The CSV formatted record
    afield: Array[1..n] of CSV formatted fields
    comma: Optional, the actual CSV field delimiter, default CSVSEP
    quote: Optional, the actual CSV quote character, default CSVQUOTE

csvparse() - Parse a CSV record and build an array of clean value fields
  num_fields = csvparse(csvrecord, afield [, comma [, quote [, trim]]])
    csvrecord: The CSV formatted record
    afield: Array[1..n] of clean value fields
    comma: Optional, the actual CSV field delimiter, default CSVSEP
    quote: Optional, the actual CSV quote character, default CSVQUOTE
    trim: Optional, the whitespace trim policy, default CSVTRIM

CSV generation

csvquote() - Format a CSV field from a clean value, quoting if necessary
  csv_field = csvquote(string [,comma [,quote]])
    string: The value to be formatted as CSV field
    comma: Optional, the desired CSV field delimiter, default CSVOSEP
    quote: Optional, the desired quote character, default CSVOQUOTE

csvcompose() - Compose a CSV record from an array of fields
  csv_record = csvcompose(afield [,comma [,quote]])
    afield: Array[1..n] of values to be formatted as CSV fields
    comma: Optional, the desired CSV field delimiter, default CSVOSEP
    quote: Optional, the desired quote character, default CSVOQUOTE

csvformat() - Build a CSV record from a record with fixed field delimiters
  csv_record = csvformat(record [,fs [,comma [,quote]]])
    record: The record to be formatted as CSV
    fs: Optional, the record field separator, default OFS
    comma: Optional, the desired CSV field delimiter, default CSVOSEP
    quote: Optional, the desired quote character, default CSVOQUOTE

CSVMODE aware printing

csvprint() - Print the current record, formatted as CSV.
             Works in any CSVMODE (1, 0, -1)
  csvprint()

Column headers

In CSV input modes the clean field values of the first record of a CSV input file are internally stored in an array of column headers. Any individual field can be accessed by its column name.

csvfield() - Retrieve the current value of the field with the given column name
  val = csvfield(name [, missing])
    name: The column name
    missing: Optional, returned if no such name, default null value

Convenience drivers

The csvmode distribution includes two convenience drivers, csvmode1.awk and csvmode-1.awk, to simplify one-line invocations. They include the csvmode library and set the CSVMODE variable to 1 and -1, respectively, in a BEGIN block.

# Convert CSV data to TAB-delimited
gawk -i csvmode-1 -v CSVOFS="\t" "1" ...CSV_data...