Author: Manuel Collado,
<m-collado@users.sourceforge.net>,
http://mcollado.z15.es
CSVMODE Version 0.2a, January 2021
Public
domain
Contents
The Comma-Separated-Values (CSV) data format is commonly used by spreadsheets and database engines to import and export data as plain text files.
A CSV file is a sequence of records separated by newline marks. A CSV record is a sequence of fields separated by commas. A field can contain almost any text. If a field contains commas, newlines or double quotes it must be enclosed in double quotes. Double quotes inside a field must be escaped by doubling them. Example:
author,title,remarks Shakespeare,A Midsummer Night's Dream,comedy "Stevenson, Robert Louis",Treasure Island,novel anonymous,"A ""quoted"" word","remark 1 remark 2"
There are four records, each one with three fields. The field
"Stevenson, Robert Louis"
is quoted because it contains a
comma. The field "A ""quoted"" word"
is quoted because it
contains escaped quotes (coded as duplicates). The third field of the last
record has two lines of text. The data is equivalent to the following
table:
author | title | remarks |
---|---|---|
Shakespeare | A Midsummer Night's Dream | comedy |
Stevenson, Robert Louis | Treasure Island | novel |
anonymous | A "quoted" word | remark 1 remark 2 |
Download and unpack the csvmode-x.x.zip
(or
csvmode-x.x.tgz
) distribution archive in a
directory of your choice. And either add the csvmode
subdirectory to your AWKPATH environment variable, or copy the
csvmode*.awk
files to a directory already in AWKPATH.
You can see what is your current effective AWKPATH by issuing the command:
gawk 'BEGIN {print ENVIRON["AWKPATH"]}'
On Windows you should type:
gawk "BEGIN {print ENVIRON[\"AWKPATH\"]}"
The CSVMODE library requires GNU awk (gawk
) version 4.1.0
or later. You can see what is your current gawk
version by
issuing the command:
gawk --version
The CSVMODE library provides facilities for:
The CSVMODE library must be explicitly included with either a -i csvmode option in the command line or a @include "csvmode" directive in the awk script code.
A couple of functions can convert between CSV fragments and clean text fragments.
csvquote: <b,"c> --> <"b,""c"> csvunquote: <"b,""c"> --> <b,"c>
The csvsplit()
function divides a CSV
formatted record string into pieces. The CSV fields are stored as elements
of an array. Example:
data --> a,"b,c",d n = csvsplit(data, af)
gives
n = 3 af[1] -> <a> af[2] -> <"b,c"> af[3] -> <d>
It is possible to handle CSV data that use alternate delimiter or quote characters. For instance, if the record uses semicolons instead of commas to delimit fields, and single quotes instead of double quotes:
data --> a;'b;c';d n = csvsplit(data, af, ";", "'")
gives
n = 3 af[1] -> <a> af[2] -> <"b;c"> af[3] -> <d>
Automatic parsing of CSV data files is controlled by a predefined CSVMODE control variable. If set to a non-zero, non-null value the input data file reader automatically recognizes CSV records and split them into fields as expected. The fields are delivered as $1, $2, ... $NF as usual. The user can choose to get CSV fragments (with CSVMODE=1) or clean text values (with CSVMODE=-1). If CSVMODE is undefined or has a null value, input files are processed in the normal way.
Sample data file:
a,b,c p,"q,r",s x,"""y""",z
Awk script for CSV fragments:
@include "csvmode" BEGIN { CSVMODE = 1 } { print $2 }
Result:
b "q,r" """y"""
Awk script for clean text fields:
@include "csvmode" BEGIN { CSVMODE = -1 } { print $2 }
Result:
b q,r "y"
The parsing process can be customized in order to accept non-standard CSV data files. A couple of predefined variables can be used to specify the special field delimiter and quoting characters:
CSVSEP: The special character that delimit the fields. By default a comma (,).
CSVQUOTE: The specific character used to quote values. By default a double quote (").
Sample data file:
a;b;c p;q,r;s x;'"y"';z
Awk script:
@include "csvmode" BEGIN { CSVMODE = 1; CSVSEP = ";"; CSVQUOTE = "'" } { print $2 }
Result:
b q,r '"y"'
The whole CSV record is stored as $0. The format depends on CSVMODE:
In the latter case the original CSV record is not lost, but stored in a predefined CSVRECORD variable. And the user can change the field delimiter by assigning a custom value to a CSVOFS predefined variable.
Sample data file:
a,b,c p,"q,r",s x,"""y""",z
Awk script:
@include "csvmode" BEGIN { CSVMODE = -1; CSVOFS = "|" } { print }
Result:
a|b|c p|q,r|s x|"y"|z
File processing in the automatic CSVMODE correctly recognizes CSV records with multiline fields. I.e., fields that contain newline characters.
Sample data file:
a,b,c p,"q r",s x,"""y""",z
Awk script:
@include "csvmode" BEGIN { CSVMODE = -1 } { print "<" $2 ">" }
Result:
<b> <q r> <"y">
A CSV record can be created two ways:
Examples:
f[1] = "007" f[2] = "Bond, James" f[3] = "United Kingdom" csvcompose(f) -> <007,"Bond, James",United Kingdom> record = "007/Bond, James/United Kingdom" csvformat(record, "/") -> <007,"Bond, James",United Kingdom> csvformat(record, "/", ";") -> <007;Bond, James;United Kingdom>
A convenient csvprint()
function prints the current record
as CSV data in any CSVMODE (1, 0, -1).
Sample data file:
a b,c "hello"
Awk script:
@include "csvmode" { csvprint() }
Result:
a,"b,c","""hello"""
Gawk offers three different input modes, controlled by the FS, FPATH
and FIELDWIDTHS variables. The csvmode
library offers two
additional modes for CSV input files, controlled by CSVMODE=1 and
CSVMODE=-1, respectively. It also provides a set of string handling
functions to manually handle CSV formatted text.
The appropriate input mode depends on how the data will be processed. The following advises can be of help.
CSV data can be handled with just the functions provided by the
csvmode
library. In this case the csv*()
functions can be seen as a first-aid facilities. This mode is suitable
when most part of the code doesn't depend on the fact that some data are
formated as CSV. Or when the CSV notation is used only for output.
This mode is suitable when the clean text field values are only occasionally needed. For instance, for selecting or reordering some columns. Or assigning new field values that doesn't depend on the value of other fields. Or just extracting some records with patterns based on the CVS notation.
This mode is not suitable if the input and output CSV notations use different delimiter or quote characters.
This mode is possibly the most powerful one. All computations may be done internally on clean text values. And CSV data is automatically parsed on input and generated on output, if necessary. In addition, this mode allows mixed processing of clean text and CSV files with the same set of pattern-action rules.
The only drawback is probably that some quoting on input can be lost when some field values are reproduced on output. Output CSV fields are quoted only if necessary.
The usecases/
directory of the csvmode
distribution contains a set of simple use cases that can help new users to
start using the tool.
The CSVMODE control variable must be set to a non-null value in order to enable CSV input parsing.
CSVMODE = 0 Disable CSV input parsing (default) = 1 Enable CSV input parsing = -1 Convert CSV input records into regular awk records CSVRECORD: For CSVMODE=-1, stores the original CSV input record
The value of CSVMODE is checked at BEGINFILE time. If non null, the
csvmode
library takes control of FS, OFS and FPAT variables
during the input file processing. These variables are automatically
restored at ENDFILE time. The effect of changing these variables in the
middle of the file processing in unpredictable. Changing CSVMODE in the
middle of the file has no effect.
If the user wants to set CSVMODE in a BEGINFILE clause, then the
relevant BEGINFILE block must be coded before the @include
"csvmode"
directive.
BEGINFILE { CSVMODE = FILENAME ~ /\.csv$/ } @include "csvmode"
In this mode a suitable FPAT value is computed and applied. $0 contains the whole CSV input record and $1..$NF contain the CSV field fragments, as given by the FPAT parsing feature.
FPAT is computed according to CSVSEP and CSVQUOTE. Also OFS is temporarily set to CSVOSEP, so assigning new values to individual fields or to the whole record should work as expected.
In this mode the input CSV record is demangled and converted to a regular awk record. $1..$NF contain the clean values of the fields, and $0 contains the concatenation of these fields, delimited by CSVOFS. Also FS and OFS are temporarily set to CSVOFS, so assigning new values to individual fields or to the whole record should work as expected.
CSVOFS must contain a character not used in the CSV input file. The default SUBSEP character should work in almost all cases.
CSV data management can be customized by a variety of optional variables. If any variable is not explicitly set, the corresponding default value is used.
CSVSEP Input field delimiter, default comma (,) CSVQUOTE Input quoting character, default double quote (") CSVTRIM = 0 Preserve all space in fields (default) = 1 Discard unquoted leading or trailing space = 2 And also collapse contiguous space into a single space CSVONELINE Flag, disable multiline fields (default enabled) CSVOSEP Output field delimiter, default CSVSEP CSVOQUOTE Output quoting character, default CSVQUOTE CSVOFS Field separator for CSVMODE=-1, default SUBSEP
csvunquote() - Extract the clean value of a CSV field value = csvunquote(csvfield [, quote [, trim]]) csvfield: The CSV formatted field quote: Optional, the actual CSV quote character, default CSVQUOTE trim: Optional, the whitespace trim policy, default CSVTRIM
csvsplit() - Split a CSV record into an array of CSV fields num_fields = csvsplit(csvrecord, afield [, comma [, quote]]) csvrecord: The CSV formatted record afield: Array[1..n] of CSV formatted fields comma: Optional, the actual CSV field delimiter, default CSVSEP quote: Optional, the actual CSV quote character, default CSVQUOTE
csvparse() - Parse a CSV record and build an array of clean value fields num_fields = csvparse(csvrecord, afield [, comma [, quote [, trim]]]) csvrecord: The CSV formatted record afield: Array[1..n] of clean value fields comma: Optional, the actual CSV field delimiter, default CSVSEP quote: Optional, the actual CSV quote character, default CSVQUOTE trim: Optional, the whitespace trim policy, default CSVTRIM
csvquote() - Format a CSV field from a clean value, quoting if necessary csv_field = csvquote(string [,comma [,quote]]) string: The value to be formatted as CSV field comma: Optional, the desired CSV field delimiter, default CSVOSEP quote: Optional, the desired quote character, default CSVOQUOTE
csvcompose() - Compose a CSV record from an array of fields csv_record = csvcompose(afield [,comma [,quote]]) afield: Array[1..n] of values to be formatted as CSV fields comma: Optional, the desired CSV field delimiter, default CSVOSEP quote: Optional, the desired quote character, default CSVOQUOTE
csvformat() - Build a CSV record from a record with fixed field delimiters csv_record = csvformat(record [,fs [,comma [,quote]]]) record: The record to be formatted as CSV fs: Optional, the record field separator, default OFS comma: Optional, the desired CSV field delimiter, default CSVOSEP quote: Optional, the desired quote character, default CSVOQUOTE
csvprint() - Print the current record, formatted as CSV. Works in any CSVMODE (1, 0, -1) csvprint()
In CSV input modes the clean field values of the first record of a CSV input file are internally stored in an array of column headers. Any individual field can be accessed by its column name.
csvfield() - Retrieve the current value of the field with the given column name val = csvfield(name [, missing]) name: The column name missing: Optional, returned if no such name, default null value
The csvmode
distribution includes two convenience drivers,
csvmode1.awk
and csvmode-1.awk
, to simplify
one-line invocations. They include the csvmode
library and
set the CSVMODE
variable to 1 and -1, respectively, in a
BEGIN block.
# Convert CSV data to TAB-delimited gawk -i csvmode-1 -v CSVOFS="\t" "1" ...CSV_data...