Application logging using global inception identifier

March 7, 2015

For multithreaded programs that write to log files a best practice is to include a tracking ID. What should this ID be and how to use it? The following is presented as a ‘design pattern’.

TL;DR

For machine readability and tool use, a non-business related global inception ID (GIID) should be used for every log output. This ID is a combination of an ‘origin id’ and ‘unique id’. When each new business level or operational id is obtained or generated, it is logged to provide an ‘association’ with this GIID. This is similar to various existing practices so it is presented as a Software Design Pattern.

Author: Josef Betancourt, Jan 11, 2015. Originally posted here.

CR Categories: H.3.4 [Systems and Software]; D1.3 [Concurrent Programming]; D.2.9 [Management]; K.6 [MANAGEMENT OF COMPUTING AND INFORMATION SYSTEMS]; K.6.3 [Software Management]

  1. Context
  2. Forces
  3. Solution
  4. Consequence
  5. Implementation
  6. Appendix
  7. Further reading
  8.  

Context

App Logs

An application log file records runtime programmatic events: details, exceptions, debug, and performance related measures. This is different than specialized server log files, such as a webserver’s access logs, error.log, etc. The latter are more standardized, for example with W3C Extended Log File Format, and well supported.

App logs usually capture details at specific logging levels and named contexts. In the Java ecosystem there are plenty of libraries to support this and now the JDK supports this as part of the java.util.logging package.

Despite the advances made in logging APIs and support within operating systems and frameworks, app logs are at a primitive level of software maturity. What A. Chuvakin and G. Peterson describe as the “… horrific world of application logging” is composed of ad hoc, minimally documented, unstructured, untested, and under-specified delivered components.

Attempts to create widely used industry standards have failed and every business, software project, dev team, or industry reinvents and attempt to tackle the same problems.
 

Forces

In the context of server app logs, multiple sessions will output log information that can be intermixed. These sessions can be part of an ongoing process, such as a user interaction with a web site.

External integration points (web services, database, etc) may also be invoked. Unless each session is identified in the log and integration logs, subsequent support, debug, and auditing are very difficult.

The problem is not just tracking where and when ‘integration’ occurred or its non-functional integration criteria (i.e., timing), but the tracking of subsequent logging, if any, at that location.

App logs are used extensively during development. Their importance is illustrated by an old mime “debuggers are for wimps”. As such, logs with impeccable tracking used for design and test are a good Return On Investment (ROI).

The same is true for deployed systems. In many cases the only information available on historical behavior is in a log file.

This seems like a programming 101 lesson, but it is widely disregarded in practice. That log file output is a minor concern and sometimes not even part of a “code review” is puzzling.

Scenarios

1. A service must invoke a distributed call to another system. The service has retry logic, and logs each failure. If each log output does not identify the session or operation, the retries could get mixed with other threads. Identifying an end user’s request is a hit or miss bracketing of time stamps if the development team did not enough identifiable data in each log output.

2. A computer savvy end user or family may attempt to register into your system with multiple browsers simultaneously. This could cause problems if multiple logins are supported and an error occurs. How do you track this and debug it?

3. The app server makes a remote call to a service integration point and that service fails. How is the owner of that service informed as to the specific invocation? There are probably deployed systems where one would have to compare time stamps on log output to even coordinate where the two systems communicated and even then it is vague. Some systems may not even do any logging and the unless there is a fault of some kind.

4. You have to identify time periods based on hazy user complaints, search through multiple log files with regular expressions, then walk each output to recreate a specific error scenario. Isn’t this manual drudgery what computers were supposed to eliminate?
 

Solution

Global Inception ID

Logging with Unique identifiers is encouraged as a best practice:

“Unique identifiers such as transaction IDs and user IDs are tremendously helpful when debugging, and even more helpful when you are gathering analytics. Unique IDs can point you to the exact transaction. Without them, you might only have a time range to use. When possible, carry these IDs through multiple touch points and avoid changing the format of these IDs between modules. That way, you can track transactions through the system and follow them across machines, networks, and services.” — http://dev.splunk.com/view/logging-best-practices/SP-CAAADP6
 

This unique identifier is generalized so that on first entry into a system or the start of a process, A Global Inception ID (GIID), is assigned to distinguish that ongoing process from others. A more descriptive term would be a Global Tracking ID, but that conjures up privacy and security concerns and is already being used in commerce for a different purpose. But ‘inception ID’ brings up visions of barcodes on people’s foreheads. Ok, how about “bludzwknxxkjysjkajskjjj”?

The term “Global” is to indicate that this ID is unique in a specific enterprise system. The uniqueness comes from its creation on a first contact basis on a specific subsystem. In essence this is a log tracking ID.

For example, a web server or an app server would be the first point of contact or request from an external User. The GIID, consisting of a combination of origin id and a unique id, would be created at this point. GIID ::= originID uniqueID

In article “Write Logs for Machines, use JSON” Paul Querna uses the term “txnId” for this type of ID:

“… this is essentially a unique identifier that follows a request across all our of services. When a User hits our API we generate a new txnId and attach it to our request object. Any requests to a backend service also include the txnId. This means you can clearly see how a web request is tied to multiple backend service requests, or what frontend request caused a specific Cassandra query.”
 

Another term for this GIID, or ‘txnId’ is Correlation ID. This terminology is used in SharePoint.

 
The correlation ID is a GUID (globally unique identifier) that is automatically generated for every request that the SharePoint web server receives.

Basically, it is used to identify your particular request, and it persists throughout further communications, if any, between the other servers in the farm. Technically, this correlation ID is visible at every level in the farm, even at a SQL profiler level and possibly on a separate farm from which your SharePoint site consumes federated services. So for example, if your request needs to fetch some information from an application server (say, if you are using the web client to edit an Excel spreadsheet), then all the other operations that occur will be linked to your original request via this unique correlation ID, so you can trace it to see where the failure or error occurred, and get something more specific than “unknown error”. — https://support.office.com/en-nz/article/SharePoint-2010-Correlation-ID-in-error-messages-what-it-is-and-how-to-use-it-5bf2dba7-43d2-484c-8ef4-e059f76e3efa

Various ‘Structured Logging’ efforts or syslog implementations already contain a ‘sending’ field specification. The GIID incorporates the sender id as the Origin ID, and this combination is more amendable to human and textual tools parsing.

 

Consequence

Size

A good candidate for a GIID must be large enough to satisfy uniqueness requirements. This could be, for example, a 36 character field. Where the log files are manually inspected with a text editor, this increases the log line which already contains many common fields like a time stamp.

Security

Unintentionally, “bad” logging practices makes it harder to track and correlate personally identifiable information (PIN). With the use the trans-system GIID, correlation between various business related identifiers is made easier.

The correlation ID is not necessarily a secret, but like other tracking objects like cookies, can be used for information discovery or questionable information storage. But, if an attack can already access your log files, there are other more serious issues?

Redundancy

What determines the first contact subsystem? A true distributed system could be configured or dynamically configured so that any system could be the first contact system. If so, then each subsystem is creating GIID and passing that GIID to other systems that are themselves creating GIIDs.

One approach to handle this is that a system will only create a GIID if none is present in the incoming request.

Feedback

For user interfaces, exposing the GIID or parts of it in exception situations can be beneficial:

“We also send the txnId to our user’s in our 500 error messages and the X-Response-Idheader, so if a user reports an issue, we can quickly see all of the related log entries.” — https://journal.paul.querna.org/articles/2011/12/26/log-for-machines-in-json/
 

Compare this to the Hunt The Wampus adventure in enterprises that only have an approximate time of an issue and must track this over multiple systems.

Accuracy

If a giid is part of a support system and as above the ID would be shared with Users in some way, would the value need some form of validity testing? Should it be tested that it is wellformed and include a checksum?

Example crc calculation for a UUID, based on textual version of id:

groovy -e "java.util.zip.Adler32 crc = new java.util.zip.Adler32(); crc.update(UUID.randomUUID().toString().getBytes());println Long.toHexString(crc.getValue())"
af9c09a3

 

Implementation

Origin ID

An OID uniquely identifies a system in an enterprise. This could be a web server or messaging system. Using a code for the actual system is recommended. Thus, instead of Acctsys, it would be a code, PE65K for example. Using a code is more ‘durable’ than a human readable name.

An extension is to also encode other information in the origin ID, such as application or subsystem identifiers.

Unique ID

This ID must not be a business related entity such as user id or account number. The simple reason is that these may occur in the logging record multiple times for different sessions or transactions. For example, user Jean Valjean with account number 24601 may log in multiple times into a web site. Tracking a specific session if a problem occurs is easier if we use a unique ID.

A business level ID may not even be relevant in another system that interacts with the origin point. In one system the ID could be one type of ID, and in the other the same party or user could be identified with a different ID.

Note that as soon as determined, accessed, or generated, a business level ID should be associated with the GIID. This could be a simple log output of that business ID which, since every log output has a GIID, will associate the business ID or value with the GIID.

Similarly when the same process communicates with another system, that system’s unique identifiers and related business IDs will also be associated with the GIID. For example, a web service will take the GIID and relate it to its own ID(s). Now a support engineer can follow the complete path of the system response to a user session.

ID creation

The easiest approach is to use the entry system’s session id created by the server. A potential problem is that this session id is not guaranteed to be unique and ends when the session expires. A UUID solves most problems.

Sample UUID generation in Groovy language:

groovy -e "println UUID.randomUUID().toString().replace('-','')"
1f788da1ac4a43bb82adb8e61cfcb205 

If the system ID is 3491 then the above UUID is used to create the GIID and use in logging:

20110227T23:34:37,901; EV={_ID:”34911f788da1ac4a43bb82adb8e61cfcb205″, USERID:”felixthecat”, ….}

Alternative to UUID use?

A UUID is a 32 character string. Could something smaller be used? Perhaps, but eventually the complexity of threaded systems would make the uniqueness constraint of any ID approach a comparable length.

Other approaches are possible. Most of these will incorporate a timestamp in some way. Note that a UUID contains a timestamp.

An example of a ‘unique’ id is used by MongoDB’s ObjectID specification. That spec calls for a 12-byte BSON type of:
• a 4-byte value representing the seconds since the Unix epoch,
• a 3-byte machine identifier,
• a 2-byte process id, and
• a 3-byte counter, starting with a random value.
An example of an ObjectID string representation is ObjectId(“507f1f77bcf86cd799439011”)

Log Framework support for GIID

The use of a GIID is a ‘cross-cutting’ concern. Requiring programmatic use of this ID would be burdensome and error-prone, even if stored in a thread-safe context.

Some logging frameworks support the concept of “nested diagnostic contexts”. This is a way of storing an ID so that interleaved logging is properly identified. See http://wiki.apache.org/logging-log4j/NDCvsMDC for more information.

Example usage

In a conventional Java server application a JSP or template system would obtain a GIID and insert it into generated pages and associated client side scripts. That GIID would also be stored in the server side session. Since the GIID is stored at the session it is accessible to the various services and components on a server.

This ID is embedded in request to other distributed servers and provides event correlation. Thus the logging systems will have access to the GIID and Client or server side errors can then display or use the GIID for tracking and reporting to assist support engineers.

Since the client also has the GIID, it can display or use this ID for customer service processing.

Of course, this would make more sense if it is a part of a wider Application Service Management (ASM) system.

Standards for IDs

Though many standards specify ID handling, modern architectures, especially web based or distributed, emphasize a stateless protocol. A GIID requirement could be one of those legerdemain stateful practices.

Development impacts

If logging is a deployed feature of an application then it too needs testing. But, since log output is an integration point, it does not fall under the “unit” testing umbrella. There is even some doubt if this should even be tested! Here is one example: Unit Testing: Logging and Dependency Injection
If log files can contain security flaws, convey data, impact support, and impair performance, then they should be tested that they conform to standards. Project management spreadsheets needs to add rows for logging concerns.

Technology

Log output can be developer tested using the appropriate XUnit framework, like JUnit.
Mocking frameworks provide a means of avoiding actually sending the output of a logger to an external ‘appender’. “Use JMockit to Unit test logging output”.
Issues
In development of a project, the log output changes rapidly as the code changes. Selecting where in the software development life cycle (SDLC) to test logging or even specify what logs should contain is difficult.
One approach is that the deployed system will not do any app logging that was not approved by the stake holders. These must be “unit” tested, and all development support logging is removed or disabled except for use in a development environment.

Deployment

There is no need to change every subsystem to use this log tracking approach. If the GIID is created somewhere in the “path” of a process or service, it adds value. Other systems can gradually start to use a tracking ID. Thus, the tools and training to use this capability can also be progressively introduced.

About this post

I was going to title this article ‘Logging in Anger’, as a response to my own experiences with application logging. Alas, there are so many issues that I had time to only focus on one as a result of a recent stint supporting an application that exhibits the typical logging anti-patterns. Example: it’s bad to get a null pointer exception, but to not know which argument to a function caused this?
 

Appendix

Structured Logging

(this article was going to add more info on incorporating a GIID into a Structured Logging framework. This section is here for refernce)
Structured Logging is a type of app log file that is data based rather than prose based. Thus, it is machine readable and amendable to high-level tools, not just a text editor.

Treating logs as data gives us greater insight into the operational activity of the systems we build. Structured logging, which is using a consistent, predetermined message format containing semantic information, builds on this technique …. We recommend adopting structured logging because the benefits outweigh the minimal effort involved and the practice is becoming the default standard. — http://www.thoughtworks.com/radar/techniques/structured-logging

 
An example, is a system where the log output uses a predetermined message format. An overview of such systems is found in chapter 5 of “Common Event Expression”, http://cee.mitre.org/docs/Common_Event_Expression_White_Paper_June_2008.pdf

Note this should not be confused with a similar sounding technology called “Log-structured file system”.
 

Further reading

  1. Log management and intelligence, http://en.wikipedia.org/wiki/Log_management_and_intelligence
  2. Logging a global ID in multiple components, http://stackoverflow.com/questions/1701493/logging-a-global-id-in-multiple-components
  3. Application Service Management (APM) system
  4. Application performance management, http://en.wikipedia.org/wiki/Application_performance_management
  5. The art of application logging, http://www.victor-gartvich.com/2012/05/art-of-application-logging.html
  6. Patterns For Logging Diagnostic Messages, http://c2.com/cgi/wiki?PatternsForLoggingDiagnosticMessages
  7. UUID, UUID
  8. How to test valid UUID/GUID?
  9. Log Data as a Valuable Tool in the DevOps Lifecycle (and Beyond), http://devops.com/features/log-data-valuable-tool-devops-lifecycle-beyond/
  10. OWASP – Logging Cheat Sheet, https://www.owasp.org/index.php/Logging_Cheat_Sheet
  11. How to Do Application Logging Right, http://arctecgroup.net/pdf/howtoapplogging.pdf
  12. Request for comment Structured Logging, http://www.mediawiki.org/wiki/Requests_for_comment/Structured_logging
  13. 6 – Logging What You Mean: Using the Semantic Logging Application Block, http://msdn.microsoft.com/en-us/library/dn440729(v=pandp.60).aspx
  14. A Review of Event Formats as Enablers of event-driven BPM, http://udoo.uni-muenster.de/downloads/publications/2526.pdf
  15. Basic Android Debugging with Logs, http://www.androiddesignpatterns.com/2012/05/intro-to-android-debug-logging.html
  16. Mapped diagnostic context vs Nested diagnostic context, http://wiki.apache.org/logging-log4j/NDCvsMDC
  17. Building Secure Applications: Consistent Logging, http://www.symantec.com/connect/articles/building-secure-applications-consistent-logging
  18. Log for machines in JSON, https://journal.paul.querna.org/articles/2011/12/26/log-for-machines-in-json/
  19. Logging Discussion, http://c2.com/cgi/wiki?LoggingDiscussion
  20. CEE, http://cee.mitre.org/docs/Common_Event_Expression_White_Paper_June_2008.pdf
  21. CEE is a Failure., https://gist.github.com/jordansissel/1983121
  22. Centralized Logging Architecture, http://jasonwilder.com/blog/2013/07/16/centralized-logging-architecture/
  23. Centralized Logging, http://jasonwilder.com/blog/2012/01/03/centralized-logging/
  24. Logging and the utility of message patterns, http://calliopesounds.blogspot.com/2014/07/the-power-of-javatextmessageformat.html?m=1
  25. Payment Application Data Security Standard, https://www.pcisecuritystandards.org/documents/pa-dss_v2.pdf
    Payment application must facilitate centralized logging.
    Note: Examples of this functionality may include, but are not limited to:
    • Logging via industry standard log file mechanisms such as Common Log File System (CLFS), yslog, delimited text, etc.
    • Providing functionality and documentation to convert the application’s proprietary log format into industry standard log formats suitable for prompt, centralized logging.
    Aligns with PCI DSS Requirement 10.5.3
  26. NIST 800-92 “Guide to Computer Security Log Management”, http://csrc.nist.gov/publications/nistpubs/800-92/SP800-92.pdf
  27. UniqueID Generator: A Pattern for Primary Key Generation, http://java.sys-con.com/node/36169
  28. java.util.logging, http://docs.oracle.com/javase/7/docs/api/java/util/logging/package-summary.html

Jenkins CI Server is great

March 16, 2012

Finally got a Jenkins server installed. Had a host of system issues, like communicating to our source code repo.

Jenkins is a joy to use. Well, it is not perfect, what is? Like, I need to pass the user’s name that invoked a build via Jenkins to the target DOS script (yea, Windows) that eventually invokes the legacy Ant scripts. A quick Google search shows that this is asked in various ways, but no answers. For example, here or here. Hmmmm.

Anyway, now comes a trial use, to see if it is what we really need and can we manage it to do what we will want. With 400 plugins, I don’t see how it could lack. Plus, I’m sure I can use the Groovy plugin to cobble something up. Jenkins even includes a Groovy Console. Finally, there is a road map for possible migration of legacy Ant scripts to Gradle using the Gradle Plugin.

I take back my past snarky comment. Jenkins is not just a pretty face on Cron.

BTW, is there some Wiki law that says a wiki shall never ever have a link to the parent project? If you get tossed into a wiki by following a link, invariably you will click in agony at links that should go to the real home. Instead, you have to edit the URL in the address bar. Since I never curse, I can’t write “wtf”.

Off Topic
Was watching the Easyb intro video. BDD is interesting. Definitely “should” is a better then “test”. With so many great tools why are products still bug ridden?

More stuff

  1. Jenkins home page
  2. Continuous integration Not a very good Wikipedia article
  3. Continuous Integration Much better
  4. Continuous Integration in Agile Software Development
  5. Hooking into the Jenkins(Hudson) API
  6. Five Cool Things You Can Do With Groovy Scripts
  7. Parameterized Builds in Jenkins – choosing subversion folders
  8. Groovy Console
  9. Groovy plugin
  10. Switching to Jenkins–Download and Install Artifact Script for Tester
  11. Gradle Plugin

Virtual Machine Applicance for development environment

January 1, 2012

Configuration of a development environment can be very time consuming, error prone, or difficult. This is especially true when investigating or getting up to speed on a new technology or framework. In a corporate environment this is a also a drain on resources and existing developer staff who must take the time to prep a new developer.

One approach to mitigate this is to use a Virtual Appliance.

Virtual appliances are a subset of the broader class of software appliances. Installation of a software appliance on a virtual machine creates a virtual appliance. Like software appliances, virtual appliances are intended to eliminate the installation, configuration and maintenance costs associated with running complex stacks of software.

A virtual appliance is not a complete virtual machine platform, but rather a software image containing a software stack designed to run on a virtual machine platform which may be a Type 1 or Type 2 hypervisor. Like a physical computer, a hypervisor is merely a platform for running an operating system environment and does not provide application software itself. — Virtual Appliance

Creating a Virtual Machine Applicance
The available VM software such as Oracle VirtualBox and the VMware VM have facilities to generate appliances. Thus, when a functioning development environment is created by a lead tech or group, an appliance can be generated for the rest of the team. This appliance can even be provided using a Virtual Desktop Infrastructure (VDI).

Open Virtualization Format
While a VM system can be used to create individual VM instances that can be reused, a more recent technology (supported by some vendors) is the use of OVF:

… is an open standard for packaging and distributing virtual appliances or more generally software to be run in virtual machines.

The standard describes an “open, secure, portable, efficient and extensible format for the packaging and distribution of software to be run in virtual machines”. The OVF standard is not tied to any particular hypervisor or processor architecture. The unit of packaging and distribution is a so called OVF Package which may contain one or more virtual systems each of which can be deployed to a virtual machine.

An OVF package consists of several files, placed in one directory. A one-file alternative is the OVA package, which is a TAR file with the OVF directory inside. — http://en.wikipedia.org/wiki/Open_Virtualization_Format

Using ready made appliances
Each VM vendor can/does make available an appliance marketplace. Thus, one can find ready-made LAMP based environments with a development software stack, for example.

Alternative 1, an installable virtual disk
Where resources are constrained, such as places where developers are still on 3GB of ram and ancient PCs, a Virtual Machine is just not going to cut it.

One easy alternative is to create a dev environment on an installable soft hard drive. TrueCrypt can be used for this purpose. One simply create a true crypt volume, which is just a single file. Then creates the desired dev env in that volume, and that file can now be copied to load into other dev’s workstations as a new hard drive.

TrueCrypt is really for security and privacy concerns, it encrypts data, so may not be ideal for this application. Since TrueCrypt is so useful as a virtual disk, it would be great if it had the option of not encrypting content. But, that would perhaps be outside of its feature space. For that the next alternative is available.

Alternative 2, use VHD files
An alternative is using something directly targeted at virtual disks such as the VHD file format. However, this does not seem to have easily useful public gui or command support (for the end user: developer).

On Windows following the instructions here and using these Send To scripts will allow one to seamlessly use vhd files as mountable hard disk volumes.

Note that Windows 8 will support native mounting of ISO and VHD files.

Further Reading


KARSH KALE plays “MILAN” LIVE


Hudson/Jenkins CI Server, can’t edit a job?

August 3, 2011

I was looking at a possible use of a Continuous Integration Server to quickly set up a build process.

Downloaded the Jenkins war file, put into Tomcat and defined a simple Job that invoked an Ant file to echo “Hello World!”. Cool, that was easy. But, then I wanted to expand that job to do more. Could not find a modify or edit capability. Huh? What’s up with that?

I searched and found very little. There was even some mention of using SED to edit the Job configuration XML, yeeech! Edit using a text tool for a tree-based data structure?

Anyway, not impressed. Of course, this was a quick tryout. Or maybe Linux people are so perfect they never have to edit their work. :)

Is Jenkins/Hudson just a pretty face on *nix utilities?

I looked at a few other CI Servers. So far Pulse and Team City look interesting, but they are not free.

Updates
Mar 2, 2012: Used one of the latest Jenkins version. Much much better! Though I’m having issues getting Active Directory authentication going. Can log in ok, but then it uses the wrong user “name” that our PCs must use. You know how LDAP has all this distinguished this or that.

Links

  1. Jenkins
  2. Active Directory Plugin

JSON configuration file format

May 8, 2011

JSON is a data interchange format. Should it also be used as a configuration file format, a JSON-CF?

Overview

Had to write yet another properties file for configuration info. Started to think that maybe there are better alternatives. Wondered about JSON for this.

Requirements

What are requirements of a configuration file format?

  • Simple
  • Human readable
  • Cross platform
  • Multi-language support
  • Unicode support

Looks like JSON has all the right qualities.

If all you want to pass around are atomic values or lists or hashes of atomic values, JSON has many of the advantages of XML: it’s straightforwardly usable over the Internet, supports a wide variety of applications, it’s easy to write programs to process JSON, it has few optional features, it’s human-legible and reasonably clear, its design is formal and concise, JSON documents are easy to create, and it uses Unicode.
— Norman Walsh, Deprecating XML

JSON-CF Limitations

  • Instead of angle brackets as in XML, we have quotation marks everywhere.

What does it need?

  • Inline comments, see for example, json-comments
  • Interpolation (property expansion)
  • Namespaces
  • Inheritance
  • Includes
  • Date value
  • Schema
  • Cascading

Example

{
    "_HEADER":{
            "modified":"1 April 2001",
            "dc:author": "John Doe"
    },
    "logger_root":{
            "level":"NOTSET",
            "handlers":"hand01"
     },
    "logger_parser":{
            "level":"DEBUG",
            "handlers":"hand01",
            "propagate":"1",
            "qualname":"compiler.parser"
    },
    "owner":{
             "name":"John Doe",
             "organization":"Acme Widgets Inc."
     },
     "database":{
             "server":"192.0.2.62",     
             "_comment_server":"use IP address in case network name resolution is not working",
             "port":"143",
             "file":"payroll.dat"
      }
}

Programmatic Access using Groovy

Now we can easily read this file in Java. Using Groovy is much easier, of course. Groovy version 1.8 has built-in JSON support, great blog post on this here.

import groovy.json.*;

def result = new JsonSlurper().
                          parseText(new File("config.json").text)

result.each{ section ->
	println "$section\n"
}

>groovy readConfig.groovy
Resulting in Groovy style data structure, GRON, (look ma, no quotation marks):

logger_parser={qualname=compiler.parser, level=DEBUG, propagate=1, handlers=hand01}

owner={organization=Acme Widgets Inc., name=John Doe}

_HEADER={dc:author=John Doe, modified=1 April 2001}

database={port=143, file=payroll.dat, server=192.0.2.62, _comment_server=use IP address in case network name resolution is not working}

logger_root={level=NOTSET, handlers=hand01}

In Groovy you can access the data with GPath expressions:
println “server: ” + result.database.server

You can also pretty print JSON, for example:
println JsonOutput.prettyPrint(new File(“config.json”).text)

Summary

Raised the question of the use of JSON as a configuration file format.

What I don’t like is the excess quotation marks. YAML is more attractive in this sense. But, the indentation as structure in YAML, similar to Python, may not be wise in configuration files.

Well, what is the answer, should there be a JSON-CF? I don’t know. A very perceptive caution is given by Dare Obasanjo commenting on use of new techniques in general:

So next time you’re evaluating a technology that is being much hyped by the web development blogosphere, take a look to see whether the fundamental assumptions that led to the creation of the technology actually generalize to your use case.

Updates

  • After writing and posting this I searched the topic and, of course, this is not a new question. Search. I updated the reading list below with some useful links.
  • JSON Activity Streams are an example of how JSON is used in new ways.
  • schema.org types and properties as RDFS in the JSON format: schema.rdfs.org/all.json
  • Just learned that node.js uses the NPM package manager which uses a JSON config file format.
  • Jan 7, 2012: Java JSR 353: Java API for JSON Processing

Further Reading
A very simple data file metaformat
JSON configuration file format
GRON
Cascading Configuration Pattern
NPM configuration file format
XML or YAML for configuration files
Using JSON for Language-independent Configuration Files
INI file
JSON+Comments
Comparison of data serialization formats
Data File Formats, in Art of Unix Programming, Eric Steven Raymond.
ConfigParser – Work with configuration files
JSON
java.util.Properties
RFC 4627
XML-SW,a skunkworks project by Tim Bray. Brings a bunch of XML complex together into one spec.
YAML
ISO 8601
Learning from our Mistakes: The Failure of OpenID, AtomPub and XML on the Web
Groovy 1.8 Introduces Groovy to JSON
JSON-LD
JSON Schema

 


” Sacred Place ” , R.Towner / P. Fresu, live in Innsbruck , Part 4


Easy stream parsing using Groovy, CVS example

March 22, 2011

You use every combination of options but that dam command won’t give you what you want?

I faced this last week at work. I had to get a list of my commits to CVS. I tried a bunch of stuff and also searched for a solution. None really worked well. An example of an approach is shown here: “how to search cvs comment history“.

The root problem is that the output of many tools are not always easily reusable. In this situation (and I’m sure in more modern tools like Subversion, Git, or Mercurial) the output resembles (I took out work related info):

=============================================================================
RCS file: /cvs/A...
Working file: Java So..
head: 1.1
branch:
locks: strict
access list:
keyword substitution: kv
total revisions: 4;     selected revisions: 3
description:
----------------------------
revision 1.1
date: 2011/03/  
filename: Produc...tsA
branches:  1....;
file Produ...
----------------------------
revision 1.1.4.1
date: 201....
filename: ProductsA....;
AS.....
----------------------------
revision 1.1.2.1
date: 2011/0
filename: ProductsA....;
ExampleNightMare - ....
=============================================================================

RCS file: /cvs/Am...
Working file: Java S..
head: 1.1
branch:
locks: strict
access list:
keyword substitution: kv
total revisions: 4;     selected revisions: 3
description:
----------------------------
revision 1.1
date: 2011/03/  
filename: Pro...

This output goes on for thousands of lines! Sure if you use a tool often and dug into its idioms or have a guru near by, you could probably get what you want, but …. (of topic, but why don’t Man pages and other docs give examples for every option?).

There is no need to take out the dragon book and start writing a parser (is ‘parser’ the correct term in this context?), or even create a DSL. If your very familiar with real scripting languages like Python, Perl, or even pure shell utilities, this is easy. If your not, on Windows (and don’t use Powershell), or just as another approach, Groovy is easy to use.

The usual pattern I would imagine is to just just read the input and trigger on a start phrase to indicate a block of interest, then the data is captured when the including line is subsequently detected in the input stream. However, in my situation depicted above, I did the opposite, I got the data I needed, but only printed it out when I got a subsequent trigger phrase, the commit comment.

Sure you could generalize or find some tool that does this, but you’d probably spend more time learning the tool or creating a reusable system that only you need or understand.

// file: ParseCvsLog_1.groovy
// Author: jbetancourt

def inside = false
def workingFile

new BufferedReader(new InputStreamReader(System.in)).eachLine(){ s ->
	
	if(s.startsWith("Working file:")){
		inside = true
		workingFile = s.split("Working file:")[1] // got what I want?
	}
  
	// this indicates that it is.
	def found = s ==~ /.*ExampleNightMare.*/
	if(found){
		println(workingFile)   // send to next pipe
		inside = false
	}  
	
}

Probably not a good example of idiomatic Groovy code, but easy to follow. A Groovy expert could probably do it on one line (I don’t like those smarty one-liners; one week later, you don’t know what you did.).

This is used as (all one line):

cvs inscrutable bunch of gibberish that doesn't answer question | groovy ParseCvsLog_1.groovy > myChanges.txt

Nothing new in this post, of course. The value of any scripting approach is that it is infinitely adaptable. And, when the scripting language is easy to use, the results could even be reusable. Perl, Python, and Ruby, for example, have great facilities for sharing of snippets and modular code solutions. Groovy and other JVM based languages like Scala are beginning to add this capability to Java environments.

Updates

  • 20110323T1906-5: Cleaned up the sample code a little; don’t want to give the wrong impression.
  • 20110402T1702-5: While looking thru the book “Groovy In Action” noticed that section 13.5.3 Inspecting version control, deals with this subject.

Further Reading


Cascading Configuration Pattern

October 24, 2010

Synopsis

Many existing systems load configuration in cascade to reduce the use of duplicate properties and allow fine grained customization. This post expresses this common usage into a design pattern.

Keywords:   SCM, CM, Properties, Groovy, Config, CSS

Content:  Context, Forces, Solution, Consequences, Implementation, Code Example, Related Patterns, Related Intellectual Property, Further Reading

Context

A business system uses property (configuration) files to configure particular environments or subsystems.   Many environments share the same properties and values, however, some are different and crucial.  To avoid missing any properties, all the properties are duplicated in each file, and required differences are changed appropriately. 

For example, if there are 100 properties required and there are 23 possible environments, that is 2300 lines of source to manage.  If there are any duplicated properties that do not vary between environments, then there is an opportunity to simplify the configuration system.  In this case, if we make only a “root” file have the full set of properties,  the total size is given by: 

T = L+npL; 
where 
    T is total size, 
    n is number of files, 
    p is percent of each file that is overridden.
    L is number of properties  

Here the value is 560=100+23*.2*100). The reduction over the duplicated data is 75%.

Forces

There are many issues with having the same properties in multiple locations.  One obvious disadvantage, is adding a new property would require changing multiple files.  Also the files are larger and not really cohesive.   Tracking errors is also complicated, especially run-time errors due to configuration settings.  Any solution should not add its own complexity.  Thus, we should not require new types of configuration files and requirements.

Even when configuration could be isolated or systems modularized, there may be a need to have duplicated properties or reassignment to satisfy software development life cycle (SDLC).  A system will be different in dev, test, and production environments.

Solution

A hierarchy of property sources and a system that can load each source and override properties at a prior level will provide a partial solution.  This is the approach already taken in many systems.  For example, in software applications and servers, the final properties used are composed of those taken from various standard locations in a particular Operating System host.  In Windows for example, the Environment Variables are composed of those found in System and User name space.  In software tools, such as Mercurial DVCS, there is a command to show the results of the final configuration after all configuration sources are loaded: showconfig   show combined config settings from all hgrc files. In Git, one executes: git config -l.

Many build systems such as Ant and Maven, allow and expect the cascading of property files.  Note that in Ant, the inverse policy is used, the first property assignment wins.

And, of course, this cascading is seen in the Cascading Style Sheet (CSS) technology.

Consequences

There is now the requirement that the cascade order is well known, non-variable, and robust.  Any intervening file must be loaded, else the system may destabilize.  Adding a new property or changing a property will require extra care since an environment or subsystem may be effected if the file contents are not properly designed beforehand.

To provide management and debugging support the implementation should provide traceability of configuration.  Thus, during use, the system must report (log) what was overridden, not changed, what is missing, etc.

Implementation

Using this pattern is very easy.  One just has to determine how the system handles reassignment in the configuration loader subsystem.  If it allows resetting then the file hierarchy is from global to specific.

Where complexity may come is when one wants to satisfy more sophisticated requirements.  For example, a property may need to be appended.  In OS paths, for instance, paths are combined to create the final path.  Thus, the operations on properties could be a combination of:

  • create: initial assignment of a property
  • update:  a new value is assigned
  • delete:  the property is removed
  • append: append to existing value
  • merge:  use a pattern to merge a new value
  • locked: allow no change
  • fail on replace:  if a value already assigned throw an exception.

Code Example

In Java the java.util.Properties class will overwrite an existing property with the same key value.  Thus, to cascade properties, one just reuses the same instantiated Properties Object via the various load(…) methods.

Note that in some frameworks, like Ant, the reverse is true, properties do not get overwritten. So, the “root” properties must be loaded last in a cascade sequence.

Shown in listing one is a simple implementation written in Groovy, a dynamic language on the JVM.  The PropCascade class extends Properties and adds methods to load a list of sources.  The first source in the list is the root of the “cascade”.  In addition, a few methods explicitly specify the root source. From a traceability point of view, reusing java.util.Properties class may not be optimal. It gives no access to low level capture of the actual “put” action. Even with the use of Aspect Oriented Programming, with for example the AspectJ language, there is no joinpoint available to add it.

listing 1

/**
 * File: CascadedProperties.groovy
 * Date: 23OCT10T18:13-05
 * Author: JBetancourt
 */

import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;

/**
 * An extension of Properties that adds Convenience
 * methods to load lists of sources.
 *
 * @author jbetancourt
 */
class CascadedProperties extends Properties {
	//private Properties rootProperties = new Properties();
	//private boolean firstWins = true;
	//private boolean failOnDuplicate = false;
	//private boolean isTrace = false;

	/**
	 * Load a list of properties sources.
	 * @param list
	 */
	public void load(List list) throws IOException, IllegalArgumentException {
		list.each {
			load(it)
		}
	}

	/**
	 * Explicit file path is specified.
	 * @param path
	 */
	public void load(String path) throws IOException, IllegalArgumentException {
		load(new File(path).newInputStream());
	}

	/**
	 * A load method that explicitly specifies the "default" source in
	 * the cascade order.
	 *
	 * @param inStream
	 * @param list
	 */
	public void load(InputStream inStream, List list) throws IOException, IllegalArgumentException {
		load inStream
		load list
	}

	/**
	 * A load method that explicitly specifies the "default" source in
	 * the cascade order.
	 *
	 * @param reader
	 * @param list
	 */
	public void load(Reader reader, List list) throws IOException, IllegalArgumentException {
		load reader
		load list
	}

	/**
	 * A load method that explicitly specifies the "default" source in
	 * the cascade order.
	 *
	 * @param path
	 * @param list
	 */
	public void load(String path, List list) throws IOException, IllegalArgumentException {
		load path
		load list
	}

} // end of CascadedProperties

In listing two, the JUnit test class is shown.
listing 2

/**
 * File: CascadedPropertiesTest.groovy
 * Date: 23OCT10T18:13-05
 * Author: JBetancourt
 */

import java.io.File;
import java.io.FileInputStream;
import java.util.List;

import org.junit.Before;
import org.junit.Test;

import groovy.util.GroovyTestCase;

/**
 * Test the {@link CascadedProperties} class.
 */
class CascadedPropertiesTest extends GroovyTestCase{
	private CascadedProperties cp;

	/** excuted before each test method run */
	public void setUp() throws Exception {
		cp = new CascadedProperties();
	}

	public void testloadListPaths() throws Exception {
		List list = new ArrayList();
		list.add path1
		list.add path2

		cp.load(list);

		assertEquals("v2",cp.get("k1"));
	}

	public void testloadListReaders() throws Exception {
		List list = new ArrayList();
		list.add reader1
		list.add reader2

		cp.load(list);

		assertEquals("v2",cp.get("k1"));
	}

	public void testloadListStreams() throws Exception {
		List list = new ArrayList();
		list.add iStream1
		list.add iStream2

		cp.load(list);

		assertEquals("v2",cp.get("k1"));
	}

	public void testloadStreamAndListStreams() throws Exception {
		List list = new ArrayList();
		list.add iStream2
		list.add iStream3

		cp.load(iStream1,list);

		assertEquals("v3",cp.get("k1"));
	}

	public void testloadPathAndListStreams() throws Exception {
		List list = new ArrayList();
		list.add iStream2
		list.add iStream3

		cp.load("data\\file1.properties",list);

		assertEquals("v3",cp.get("k1"));
	}

	public void testloadReaderAndListStreams() throws Exception {
		List list = new ArrayList();
		list.add reader2
		list.add reader3

		cp.load(reader1,list);

		assertEquals("v3",cp.get("k1"));
	}

	public void testPutAgain() {
		cp.put("k1", "v1")
		cp.put("k1", "v2")
		assertEquals(cp.get("k1"), "v2");
	}

	public void testLoadOneFilePath() throws Exception {
		cp.load("data\\file1.properties");
		assertEquals("v1",cp.get("k1"));
	}

	public void testLoadTwoFiles() throws Exception {
		cp.load(iStream1)
		cp.load(iStream2)

		assertEquals("v2",cp.get("k1"));
	}

	//
	// class fields
	//
	String path1 = "data\\file1.properties"
	String path2 = "data\\file2.properties"
	String path3 = "data\\file3.properties"
	File file1 = new File(path1)
	File file2 = new File(path2)
	File file3 = new File(path3)

	InputStream iStream1 = file1.newInputStream()
	InputStream iStream2 = file2.newInputStream()
	InputStream iStream3 = file3.newInputStream()
	Reader reader1 = file1.newReader()
	Reader reader2 = file2.newReader()
	Reader reader3 = file3.newReader()

}

Property Files Analysis

It is very difficult to eyeball a set of property files to see if there are many duplicates. I guess one could do some magic one line script that concatenates, sorts, prunes, etc. Here is an alternative using the Groovy language.
listing 3

/*
 * Script: AnalyizePropertyFiles.groovy
 * Author: J. Betancourt
 */
import java.util.Hashtable;
import java.util.Map.Entry;

/**
 * @author jbetancourt
 */
class AnalyizePropertyFiles {
	// these could have been created dynamically by reading the target folder.  But, that may pick up non-used files.
	static fileNames = []

	// where are the files?
	static basePath = ""

	// run analysis....
	static main(args) {
		def propList = [] // each property file will be load into a cell which contains a properties object.

		// put each file into properties object in propList.
		fileNames.each {
			def p = new Properties()
			p.load(new File(basePath + "\\" + it).newInputStream())
			propList.add(p)
		}

		// put all properties in one allProps object.
		Properties allProps = new Properties();
		propList.each {
			allProps.putAll(it)
		}// each propList
		
		def result = []
		
		int totalCount = 0
		
		// get how many times each property is used in all properties object
		allProps.each { prop ->
			String key = prop.key
			String value = prop.value
			def values = []
			int count = 0
			propList.each { p ->
				if(p.containsKey(key)){
					count++
					def curValue = p.get(key)
					if(!values.contains(curValue)){
						values.add(curValue)
					}								
				}
			} // each property file
				
			StringBuilder vb = new StringBuilder()
			values.each{ s ->
				vb.append("[" + s + "], ")
			}
			
			int numberValues = values.size()

			StringBuilder sb = new StringBuilder()			
			sb.append(count).append(",").append(values.size).append(", ").append(key).append(", ").append(vb.toString())
			result.add(sb.toString())	
			
			totalCount += count				
			
		} // each allProps property
		
		
		result.each{
			println(it)
		}
		
		println("Total number of times properties are repeated: " + totalCount)
		
	} // end main

} // end class

Related Patterns

Related Intellectual Property

Cascading configuration using one or more configuration trees“, U.S. Patent number 7760746, 30Nov2004, http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN%2F7760746


Further Reading


Thelonious Monk – ‘Round Midnight – 1966


Follow

Get every new post delivered to your Inbox.