Using Groovy to Parse Application Logs for Performance Data
Jul 08 2009
Groovy came in handy for me once again last week when I was tasked with aggregating performance data from application logs. The logs were concatenated together into one file that totalled over 2 million lines. Interspersed throughout the logs were performance timer entries. Each entry included a timestamp, a string for tagging the line as a performance timer entry, the name of the object and method called, and the number of milliseconds elapsed for the method's execution.
Performance tests were run over a two-day period under various loads. For analysis purposes we wanted to see each method's average run time for each set of conditions, so we needed a way to look at aggregated data for specified time periods. The results would be output to a CSV format for easy integration with an Excel worksheet. Finally, we didn't want to spend too much time on creating a script since it wouldn't really be reuseable outside the context of this project, due to the specific formatting of the performance timing entries.
I chose Groovy for the following reasons:
- I'm comfortable with Java, and Groovy is a superset of Java.
- It's a scripting language, so I can have code that executes outside of an object model.
- Supports closures as well as some handy convenience methods.
- Allows me to use objects and classes as necessary.
I created a Groovy script that took four arguments:
- Start time
- End time
- Input file (the log)
- Output file
The script reads through every line in the log file. Working with file reader input streams is always a fairly straightforward exercise, but it's even quicker and easier with Groovy:
File file = new File(inFile) // inFile is a string file name
file.eachLine(parse) // parse is the name of a closure
Inside the closure, the it keyword is used to refer to the line (String) that was passed in:
nTimer = it.indexOf("PERF-TIMER]") + 12 // length of "PERF-TIMER"
The parse closure is where all the action is. The algorithm is specific to the unique performance timer entries for this project, checking each line for a "PERF-TIMER" string, then for the timestamp. It then determines if the timestamp falls between the times specified by the input arguments, and proceeds from there.
Outside the closure, I defined a class to store method data:
class MethodData {
long nCount = 0
long msTotal = 0 // total milliseconds
String methodName
void add(String ms) {
nCount++
int i = Integer.parseInt(ms.trim())
msTotal += i
}
double avg() {
return msTotal / nCount
}
String toString() {
return methodName + "," + nCount + "," + msTotal + "," + avg() + "\n"
}
}
Back inside the closure, the method name is extracted from the performance timer entry and used as a key for a Map object. If data has already been captured for a given method, the milliseconds data is just added to the total. Otherwise, a new object is created and put into the Map:
if (map.containsKey(strMethodName)) {
(MethodData)(map.get(strMethodName)).add(strMilliseconds)
} else {
data = new MethodData()
data.methodName = strMethodName
data.add(strMilliseconds)
map.put(strMethodName, data)
}
All this working with performance data got me thinking: how would the Groovy script itself perform?
Running the script from an Eclipse console on a MacBook Pro (4 GB memory), it went through a 260 MB log file with 2.003 million lines in just over 64 seconds. That's about 4 MB per second, or .03 milliseconds per record. Not bad, certainly acceptable for my purposes.