All modern J2EE application are distributed in nature and typically involves web servers, application servers, database servers etc. . Diagnosing application problems in a distributed environment is challenging. Tracing problems as application requests and transactions traverses across the distributed components whether it is the web server, application server, database components is always time consuming and requires quite a bit of application knowledge. A number of monitoring tool software have promised the holy grail of transaction tracking and diagnosis but have largely failed. A lot of monitoring tools end up with large dashboards with red, yellow and green markers that just indicates if the system is running fine or has issues but not enough information to diagnose them quickly. Enterprises spend millions in licensing costs to monitor these distributed components only to spend more money on additional resources to diagnose and fix them. Most of monitoring tools end with operational teams that lack the deep knowledge to diagnose application problems quickly.
My years of experience diagnosing problems have helped me realize that to diagnose problems quickly one has to resort to application logs, stack traces, debug traces and some deep knowledge of application. Deep application monitoring tools such as Wily, ITCAM for Application Diagnostics etc. do offer deep instrumentation but have very high performance overheads when they are turned on always. Despite their deep tracing capabilities they still fall short of correlating the individual requests to provide meaning full insight to resolve problems quickly.
All application components have some kind of logs but obtaining the logs for application diagnosis from disparate systems and correlating log entries is a time consuming process.
Splunk log monitoring tool helps you search log files spread across disparate systems and organizes the results chronologically, by hosts, by log types etc., but by itself is not sufficient to help you correlate the log entries. Splunk tool offers a slightly different approach to application problem monitoring by helping scan application logs for error symptoms. However Splunk does require some prior knowledge of application exception handling and common error strings to look for. While Splunk is extremely powerful in scanning logs and displaying the log searches chronologically that can span the various distributed components, it still falls short of correlating them unless the application provides some unique identifier to tie the logs together. Building a logging framework that can be shared by the distributed application will help application diagnosis immensely.
All modern application frameworks provides some kind of mechanism to add filters to application requests or you can leverage aspect programming patterns to add contextual information as requests pass through the different layers of your application. Caching application traces for requests and printing them when requests exceeds response time threshold can greatly aid in application diagnostics.
I will be talking about such an implementation in Part 2 of this series, that leverages the Splunk deployment in our application infrastructure to enhance the performance and application diagnosis process.