CSV (Comma-Separated Values) is a popular data-interchange format. This chapter presents NeoCSV, a library to parse and export CSV files. This chapter has originally been written by Sven Van Caekenberghe the author of NeoCSV and many other nicely designed Pharo libraries.
NeoCSV is an elegant and efficient standalone Pharo library to read (resp. write) CSV files converting to (resp. from) Pharo objects.
CSV is a lightweight text-based de facto standard for human-readable tabular data interchange. Essentially, the key characteristics are that CSV (or more generally, delimiter-separated text data):
References: http://en.wikipedia.org/wiki/Comma-separated_values, http://tools.ietf.org/html/rfc4180
Note that there is not one single official standard specification.
NeoCSV contains a reader (NeoCSVReader
) and a writer (NeoCSVWriter
) to parse and generate delimiter-separated text data to and
from Smalltalk objects. The goals of NeoCSV are:
To load NeoCSV, evaluate the following or use the Configuration Browser:
Gofer it
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
configurationOf: 'NeoCSV';
loadStable.
To use either the reader or the writer, you instantiate them on a character stream and use standard stream access messages.
The first example reads a sequence of data separated by ,
and containing line breaks. The reader produces arrays corresponding to the lines with the data (the withCRs
method converts backslashes to new lines).
(NeoCSVReader on: '1,2,3\4,5,6\7,8,9' withCRs readStream) upToEnd.
--> #(#('1' '2' '3') #('4' '5' '6') #('7' '8' '9'))
The second proceeds from the inverse: given a set of data as arrays it produces comma separated lines.
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
nextPutAll: #( (x y z) (10 20 30) (40 50 60) (70 80 90) ) ].
-->
'"x","y","z"
"10","20","30"
"40","50","60"
"70","80","90"
'
NeoCSV can operate in generic mode without any further customization. While writing,
do:
protocol,asString
and quoted, andWhile reading,
The standard delimiter is a comma character. Quoting is always done using a double quote character. A double quote character inside a field will be escaped by repeating it. Field separators and line endings are allowed inside a quoted field. Any whitespace is significant.
Any character can be used as field separator, for example:
neoCSVWriter separator: Character tab
or
neoCSVWriter separator: $;
Likewise, any of the three common line end conventions can be set. In the following example we set carriage return:
neoCSVWriter lineEndConvention: #cr
There are 3 mechanisms that the writer may use to write a field (in increasing order of efficiency):
asString
and quoting it (the default);asString
but not quoting it; andprintOn:
directly on the output stream.When disabling quoting, you have to be sure your values do not contain embedded separators or line endings. If you are writing arrays of numbers for example, this would be the fastest way to do it:
neoCSVWriter
fieldWriter: #object;
nextPutAll: #( (100 200 300) (400 500 600) (700 800 900) )
The fieldWriter
option applies to all fields.
If your data is in the form of regular domain-level objects it would be wasteful to convert them to arrays just for writing them as CSV. NeoCSV has a non-intrusive option to map your domain object's fields: You add field specifications based on accessors. This is how you would write an array of Point
s.
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
nextPut: #('x field' 'y field');
addFields: #(x y);
nextPutAll: { 1@2. 3@4. 5@6 } ].
-->
'"x field","y field"
"1","2"
"3","4"
"5","6"
'
Note how nextPut:
is used to first write the header (i.e., the first line). After printing the header, the writer is customized: the messages addField:
and addFields:
arrange for the specified selectors (here x
and y
) to be sent on each incoming object to produce fields that will be written.
To change the writing behavior for a specific field, you have to use addQuotedField:
, addRawField:
and addObjectField:
.
To specify different field writers for an array (actually any subclass of SequenceableCollection
), you can use the first
, second
, third
, etc. methods as selectors:
String streamContents: [ :stream |
(NeoCSVWriter on: stream)
addFields: #(first third second);
nextPutAll: { 'acb' . 'dfe' . 'gih' }].
-->
'"a","b","c"
"d","e","f"
"g","h","i"
'
The parser is flexible and forgiving. Any line ending will do, quoted and non-quoted fields are allowed.
Any character can be used as field separator, for example:
neoCSVReader separator: Character tab
or
neoCSVReader separator: $;
NeoCSVReader will produce records that are instances of its recordClass
, which defaults to Array
. All fields are always read as Strings. If you want,
you can specify converters for each field, to convert them to integers, floats or any other object. Here is an example:
(NeoCSVReader on: '1,2.3,abc,2015/07/07' readStream)
separator: $,;
addIntegerField;
addFloatField;
addField;
addFieldConverter: [ :string | Date fromString: string ];
upToEnd.
--> an Array(an Array(1 2.3 'abc' 7 July 2015))
Here we specify 4 fields: an integer, a float, a string and a date field. Field conversions specified this way only work on indexable record classes, like
Array
.
While reading from a CSV file, you can ignore fields using addIgnoredField
. In the following example, the third field of each record is ignored:
| input |
(NeoCSVReader on: '1,2,a,3\1,2,b,3\1,2,c,3\' withCRs readStream)
addIntegerField;
addIntegerField;
addIgnoredField;
addIntegerField;
upToEnd
--> #(#(1 2 3) #(1 2 3) #(1 2 3))
Adding ignored field(s) requires adding field types on all other fields.
In many cases you will probably want your data to be returned as one of your domain objects. It would be wasteful to first create arrays and then convert all those. NeoCSV has non-intrusive options to create instances of your own classes and to convert and set fields on them directly. This is done by specifying accessors and converters. Here is an example for reading Association
s of Float
s.
(NeoCSVReader on: '1.5,2.2\4.5,6\7.8,9.1' withCRs readStream)
recordClass: Association;
addFloatField: #key: ;
addFloatField: #value: ;
upToEnd.
--> {1.5->2.2. 4.5->6. 7.8->9.1}
For each field you have to give the mutating accessor to use. You might also want to pass a conversion block using addField:converter:
.
Handling large CSV files is possible with NeoCVS. In the following, we first create a large CSV file then read it partly (this takes a bit of time, be patient).
'paul.csv' asFileReference writeStreamDo: [ :file |
ZnBufferedWriteStream on: file do: [ :out | | writer |
writer := (NeoCSVWriter on: out).
writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
1 to: 1e7 do: [ :each |
writer nextPut: {
each.
#(Red Green Blue) atRandom.
1e6 atRandom.
#(true false) atRandom } ] ] ].
Above code results in a 300Mb file:
$ ls -lah paul.csv
-rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
$ wc paul.csv
10000001 10000001 342781577 paul.csv
The following code selectively collects every record with a third field lower than 1000 (this takes a bit of time, be patient):
Array streamContents: [ :out |
'paul.csv' asFileReference readStreamDo: [ :input |
(NeoCSVReader on: (ZnBufferedReadStream on: in))
skipHeader;
addIntegerField;
addSymbolField;
addIntegerField;
addFieldConverter: [ :x | x = #true ];
do: [ :each |
each third < 1000
ifTrue: [ out nextPut: each ] ] ] ].