Apache Groovy: Powerful text file processing
This article delves deeper into reading text files with Groovy, exploring its unique strengths compared to Java. We’ll revisit file types and delve into the different ways Groovy handles reading lines from text files.
In previous articles, I went over some basics of Groovy. If you want to unlock the power of Groovy, this series will guide you through what makes it such a valuable language for developers. (If you haven’t installed Groovy yet, please read the intro to this series.)
A couple of years ago, I wrote an article about reading and writing files with Groovy. It’s a very simple demonstration of what Groovy can do differently — and arguably, better — than Java in the context of dealing with text files.
Now I’m going to take a bit deeper dive into the topic of reading files in Groovy. But first, here’s some background on the Groovy (and Java) view of the contents of files.
On your Linux desktop, files generally fall into two categories: those containing human-readable text and those containing binary data. Binary data is difficult for humans to understand directly.
I wrote this article using LibreOffice Writer and it’s stored in ODF Text format on my computer. So, text format, must be something you can read and understand, right? Well, no. If I use the more command to display the file, I can see it’s just one very long line that starts like this:
P��V^�2^L'mimetypeapplication/vnd.oasis.opendocument.textP��VConf
That’s not very useful. It turns out that the ODF Text format is binary data and can’t be rendered in a terminal. Hmm. But if you dig further, you’ll see the ODF Text format can be stored in compressed zip format. You can use the unzip command on a .odt file to pull out all its components, many of which are text files.
When I unzip this article, I get:
unzip ../groo*19.odt
Archive: ../groovy-advent-19.odt
extracting: mimetype
creating: Configurations2/images/Bitmaps/
creating: Configurations2/accelerator/
creating: Configurations2/statusbar/
creating: Configurations2/menubar/
creating: Configurations2/popupmenu/
creating: Configurations2/floater/
creating: Configurations2/progressbar/
creating: Configurations2/toolbar/
creating: Configurations2/toolpanel/
inflating: manifest.rdf
inflating: meta.xml
inflating: settings.xml
extracting: Thumbnails/thumbnail.png
inflating: styles.xml
inflating: content.xml
inflating: META-INF/manifest.xml
I can look at the article text, which is kept in content.xml in XML format, with the more command. It is readable and does kinda make sense because I understand what XML is about:
$ more content.xml
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:css3t="https://www.w3.org/TR/css3-text/" xmlns:grddl=…
$
In short, you see that the .odt file is a binary file. You can use the unzip utility to help you make sense of this file by turning it into its mostly textual components. But at a more profound level, ALL files are binary. The files that are meant to be readable by humans are structured as text. The files that are meant to be readable by unzip or other utility programs are often structured as something else.

In Java, and therefore in Groovy, there are different pathways to process files depending on whether they’re intended to be interpreted as text or as some kind of structured binary.
Focusing on the text pathway, the first part of this structuring is to recognize that the bits in the file are meant to be read as instances of Unicode characters or Unicode code points:
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit
charvalues that are code units of the UTF-16 encoding (see the official definition of the Character class in the Java language documentation).
Java, and therefore Groovy, defines a hierarchy of classes starting with java.io.Reader (see this documentation) that are used to read streams of characters – that is, text. Reader is an abstract class that is inherited, and further developed by:
BufferedReaderCharArrayReaderFilterReaderInputStreamReaderPipedReaderStringReader
Depending on where you want to acquire streams of characters, you can obtain specialized versions of these various readers. If you want to read from a text file, you should focus on the java.io.FileReader class (see this documentation), which is a subclass of InputStreamReader.
FileReader defines a convenient constructor, FileReader(String fileName), that lets you go directly from the name of a file to a reader ready to give you access to that stream of characters.
The problem with FileReader is it doesn’t define a handy readLine() method. To treat a stream of characters as having a line-oriented structure, you have to watch for line endings yourself. This is dangerous because the line-ending structure tends to be operating system-dependent. This requires you to buffer the character stream and then search for the line endings.
Another problem with FileReader is you have to wrap the FileReader in a BufferedReader. This provides a readLine() method, as well as a lines() method which returns a java.util.Stream<String> instance.
Starting in a close-to-Java way, here’s what it looks like:
1 if (args.length != 1) {
2 System.err.println "Usage: groovy Groovy19a.groovy input-file"
3 System.exit(0)
4 }
5 def reader = new BufferedReader(new FileReader(args[0]))
6 def line
7 while ((line = reader.readLine()) != null) {
8 println "line = $line"
9 }
10 reader.close()
Lines one to four deal with usage.
Line five defines the reader you need as a BufferedReader instance wrapping a FileReader instance that is attached to the filename provided as the first argument on the command line.
Line six defines the String line into which you read each line of the file.
Lines seven to nine loop over the lines of the file by calling the readLine() method on the reader and putting the value returned into the line variable until the end of file is encountered. This is represented by a null line. Each line read is printed out, prefixed by the string “line = “.
Line 10 closes the reader.
Let’s run this, using the text / line-oriented file /etc/group as the input:
$ groovy Groovy19a.groovy /etc/group
line = root:x:0:
line = daemon:x:1:
line = bin:x:2:
line = sys:x:3:
…
$
This used to be my go-to approach to reading files back in my early Java days. When Java 1.7 came along, it brought with it the java.nio.file. The Files class which removed the need to wrap FileReader with BufferedReader. It did this by providing a newBufferedReader() factory method. I can’t say that I jumped onto this immediately since it was simplifying by adding the complexity of a whole new class. But as I became more familiar with the Files class, I could see that it consolidated a whole bunch of related utilities into one place, which made it worth learning. For instance, Files provides the lines() method that takes the java.nio.file.Path of the file as an argument. Getting rid of the reader variable, BufferedReader(), and FileReader() drastically streamlines Java code. The while {} command, testing, and line variable were also eliminated. The final touch was letting it complete its own close(). However, in the spirit of giving while taking away, it also requires learning the whole Streams thing…a good practice anyway:
1 import java.nio.file.Files
2 import java.nio.file.Path
3 if (args.length != 1) {
4 System.err.println "Usage: groovy Groovy19b.groovy input-file"
5 System.exit(0)
6 }
7 Files.lines(Path.of(args[0])).each { line -> println "line = $line" }
Turns out it streamlines the Groovy code as well.
Running it:
$ groovy Groovy19b.groovy /etc/group
line = root:x:0:
line = daemon:x:1:
line = bin:x:2:
line = sys:x:3:
…
$
Another way to solve this problem is by applying the Groovy enhancements to the File class:
1 if (args.length != 1) {
2 System.err.println "Usage: groovy Groovy19b.groovy input-file"
3 System.exit(0)
4 }
5 new File(args[0]).eachLine { line -> println "line = $line" }
This one is probably my favorite since it doesn’t involve learning several new class hierarchies, plus it’s nice and compact.
When you run it, you see:
$ groovy Groovy19c.groovy /etc/group
line = root:x:0:
line = daemon:x:1:
line = bin:x:2:
line = sys:x:3:
…
The Groovy File class also has a withReader() method that calls a closure, passing it a BufferedReader instance. I don’t find this as convenient as using the eachLine() method described above. But when I’m reading from a file and writing to a different file, I definitely use the withWriter() method to have the writer available in my BufferedReader processing closure.
But I’ll save that for next time.
I’m not going to cover reading binary files here because you need to know what to do with the binary information in order to structure and interpret it. In any case, I generally prefer to work with text files whenever possible. It’s much easier to decouple processing steps and view intermediate results when text files are used to communicate between steps.
Conclusion
While Java’s early take on Readers seems a bit clunky, with the need to wrap a FileReader in a BufferedReader, use a loop to iterate over the lines in the file and remember to close the whole thing at the end, things have improved. Now reading from a file can be a single-line program, without that line being 1,000 characters long.
Once again, you see that the Groovy approach is to make the classes you already know — like File — more useful by adding a new behavior to them. Compare this to the modern Java approach, which is to add new class hierarchies that add a new behavior while consolidating old behavior, posing a steeper learning curve.
