Building Large Scale Splunk Environments? Automation and DevOps are a must.

Building Large Scale Splunk Environments? Automation and DevOps are a must.

Introduction

Splunk is a log aggregation, analysis, and automation platform used by small and large enterprises to provide visibility into computing operations and as a Security Incident and Event Monitoring platform. It is a very mature product, with deep penetration in financial services firms. The Elastic-Logstash-Kibana (ELK) stack is an open source suite with a similar feature set.   

For a large organization, log aggregation platforms can easily end up ingesting terabytes of data. Both Splunk and ELK require incoming data to be indexed and rely on distributed storage to provide scale. In practice, this means dozens, or even hundreds of individual nodes operating in a cluster. Once an organization moves past a 2-3 node log-aggregation cluster, a strong automation framework is essential for testing and deploying all components of the log aggregation platform. 

Risk Focus has built large analytics clusters for a number of clients in recent years, including those focused on financial and healthcare data analysis. Several engagements have involved building large Splunk clusters. The automation challenge for log aggregation platforms are similar in many respects to other analytics platforms. In almost all cases the three major challenges are: 

  • Managing data ingestion/archiving 
  • Configuring and scaling compute clusters required for analysis  
  • Maintaining consistent configuration across entire cluster and robust testing 

Infrastructure automation

Based on our experience, Splunk clusters containing more than a handful of nodes should not be built or configured manually. Automation delivers configuration consistency and prevents configuration drift. It also improves efficiency in resource deployment and utilization. This is true for both the base system configuration as well as cluster deployment and configuration tasks. 

Organizations using current generation technology can deploy standardized base operating system images to virtual machines, speeding up initial infrastructure deployment significantly. An automation and configuration management tool (such as Salt or Ansible) can then be utilized to deploy software and customized configuration onto each node within the network.  Cloud orchestration technology (eg. Terraform or AWS Cloudformation) may be utilized alongside configuration management tools in particularly large environments. As an aside, AWS has a managed Elasticsearch/ELK offering. For organizations considering Cloud deployments, there is no need to re-invent the wheel, the AWS offering does virtually everything an organization might want in terms of infrastructure automation, including multi-AZ deployments for High-Availability.

Utilizing such automation frameworks makes it extremely easy to scale the environment up (and in certain cases down). It also simplifies other common management tasks critical for operational stability, including: 

  • Adding search-heads or indexersThis is easier with a well-understood automated deployment process not subject to manual error. 
  • Disaster Recovery: is easier and less costly to accommodate when compute resources are stood-up quickly and confidently with automated procedures. 
  • Resource EfficiencyWhen search, ingestion and reporting follow a specific pattern during the day, or different business units require additional capacity, infrastructure automation enables re-scaling of components and re-directing resources towards other tasks/nodes. 
  • Testing/Upgrading/Patching/Re-Configuring software: A consistent and modern DevOps practice is necessary to make modifications to a large cluster reliably and with minimal downtime. 
  • Security and Auditability: Financial services firms, health-care providers, and utilities face high regulatory burdens. Auditors and regulators have an interest in the computational operations of these clients. Splunk/ELK is a good source of data for audits and an indicator of mature operational management and monitoring practices within the information technology organization. As an organization begins to rely more on Splunk for both security monitoring and operational visibility, it can expect auditors and regulators to treat Splunk as critical infrastructure and take a greater interest in the Splunk environment itself. Employing automation and consistent DevOps practices to build, deploy and manage the Splunk environment goes a long way towards allaying regulatory concerns. But more importantly, to ensure operational stability, large Splunk clusters (or any other large-scale analytical/compute environment) should use automation for initial deployments and to manage configuration drift across the fleet. 

Splunk Administration Tools

Splunk has a rich toolset of GUI and CLI tools to manage configurations for indexers, forwarders, license managers and other components in a Splunk cluster. Most firms are likely to have some form of automation/DevOps standard across the organization, in aspiration if not yet in full practice. A key part of planning for a large Splunk environment is defining where the boundary between standardized, cross-platform configuration management tools end and where the Splunk toolset exercises control. 

There is no perfect answer to where this boundary lies.  Risk Focus works with clients who have limited the use of Splunk management tools to managing licenses and performed virtually every other task with scripted automation frameworks. There are other clients, who chose to go in the opposite direction and rely largely on Splunk’s tools to manage and configure the cluster. 

Making the right decision on tooling is dependent on the practices and skill-set of the team expected to support Splunk. Organizations that rely on external service providers (including Splunk PS) to maintain their environment need to consider whether these providers are familiar with their preferred automatioor configuration management toolkits. Organizations that prioritize mobility among technology staff will want to place more emphasis on common tooling across all applications, rather than rely on Splunk-specific management tools. 

Application Life Cycle

Testing and Quality Assurance 

An important and often underappreciated part of Splunk environment management is how to deal with the testing and release of software updates and patches, reports, dashboards, and applications. 

Organizations using Splunk for critical business activity should treat Splunk like any other business critical system and follow best practice for software delivery and maintenance. That means establishing processes to test software releases. A release can impact user activity, reports, dashboards, ingestion configurations, and system performance. Building Splunk test environments and efficiently rolling changes through a development/test/deployment lifecycle requires automation. Absent such automation, test cycles become expensive and will not be executed consistently. Mature organizations using a Continuous Integration/Continuous Deployment (CI/CD) lifecycle will find that the effort expended to integrate Splunk into their CI/CD pipeline delivers enormous rewards over time. 

Release Management 

In managing releases and updates to Splunk configuration, it is useful to view Splunk in a broader context as a data analytics tool. Most organizations have some experience with data analytics tools such as SAP, SaaS, Informatica, or Business Intelligence. For these systems, organizations have often established fine-grained control over reports and dashboards used in critical business activities. This includes limiting users’ ability to change them in production. Splunk is no different. We advise customers to clearly establish permission and ownership boundaries in their production Splunk environment. These should be balanced so that they do not constrain the Splunk users’ ability to analyze data. 

One way to balance the tension between user freedom and organizational oversight is to create a dividing line between reports/dashboards used in daily operations and ad-hoc analyses. We find that the following best practices are quite useful: 

  • Critical reports and dashboards should be tightly controlled via Splunk’s permission tools 
  • Changes to critical dashboards, however minor, should be tested 
  • Software upgrades should involve user acceptance testing and automated testing of all such dashboards 
  • Data from test environments should be ingested continuously into a Splunk test environment. Application/infrastructure changes which might impact Splunk should be tested here.  
  • Critical monitoring dashboards should be validated as part of application testing to ensure any changes to software/log format do not impact these dashboards. 

Example

Splunk scales well to meet the needs of large enterprises, but the topology can get complex with increased scale.  An example of a deployment framework for a multi-tenant Splunk cluster is below. Depending on your organization’s need, automated or semi-automated scaling can be built into the framework.

Conclusion

The current generation of automation tools and DevOps best practices can deliver significant benefits to an organization seeking to manage and maintain a large Splunk cluster. Every organization should carefully consider the benefits of using such tools to manage their environment. Organizations relying on Splunk for critical operational management should treat it as such and build a robust testing framework for their environment. 

     

    Using the Siren Platform for Superior Business Intelligence

    Using the Siren Platform for Superior Business Intelligence

    Can we use a single platform to uncover and visualize interconnections within and across structured and unstructured data sets at scale?

    Objective

    At Risk Focus we are often faced with new problems or requirements that cause us to look at technology and tools outside of what we are familiar with. We frequently engage in short Proof-Of-Concept projects based on a simplified, but relevant, use case in order to familiarize ourselves with the technology and determine its applicability to larger problems.

    For this POC, we wanted to analyze email interactions between users to identify nefarious activity such as insider trading, fraud, or corruption. We identified the Siren platform as a tool that could potentially aid in this endeavor by providing visualizations on top of Elasticsearch indexes to allow drilling down into the data based on relationships. We also wanted to explore Siren’s ability to define relationships between ES and existing relational databases.

    The Setup

    Getting started with the Siren platform is easy given the dockerized instances provided by Siren and the Getting Started guide. Using the guide, I was able to get an instance of the platform running with pre-populated data quickly. I then followed their demo to interact with Siren and get acquainted with its different features.

    Once I had a basic level of comfort with Siren, I wanted to see how it could be used to identify relationships in emails, such as who communicates with each other and if anyone circumvents Chinese firewall restrictions by communicating through an intermediary. I chose the Enron email corpus that is publicly available as my test data and indexed it in ES using a simple Java program that I adapted from code I found here. I created one index containing the emails and another index of all the people, identified by email address, who were either senders or recipients of the emails. The resulting data contained half a million emails and more than 80,000 people.

    With the data in place, I next set up the indices in Siren and defined the relationships between them. The straightforward UI makes this process very simple. The indices are discoverable by name, and all of the fields are made available. After selecting the fields that should be exposed in Siren, and potentially filtering the data, a saved search is created.

    Once all of the indices are loaded and defined, the next step is to define the relationships. There is a beta feature to automatically do this, but it is not difficult to manually setup. Starting with one of the index pattern searches, the Relations tab is used to define the relationships the selected index has to any others that were defined. The fields that are used in the relationships must be keyword types in ES, or primary keys for other data sources.

    Now that the indices are loaded and connected by relationships, the next step is to create some dashboards. From the Discovery section, the initial dashboards can be automatically created with a few common visualizations that are configured based on the data in the index. Each dashboard is linked to an underlying saved search which can then be filtered. There is also a visualization component that allows for filtering one dashboard based on the selection in a related dashboard.

    Dashboard

    Each dashboard is typically associated with a saved search and contains different visualizations based on the search results. Some of the visualizations show an aggregated view of the results, while others provide an alternative way to filter the data further and to view the results. Once the user has identified a subset of data of interest in a particular dashboard, s/he can quickly apply that filter to another related dashboard using the relational navigator widget. For example, one can identify a person of interest on the people dashboard and then click a link on the relational navigator to be redirected to the emails dashboard, which will be filtered to show just the emails that person sent/received.

    The above screenshot shows two versions of the same information.  The people who sent the most emails are on the x-axis, and the people they emailed are on the the y-axis with the number of emails sent in the graph. By clicking on the graphs, the data can be filtered to drill down to emails sent just from one person to another, for example.

    Graph Browser

    One of the most interesting features of Siren is the graph browser, which allows one to view search results as a graph with the various relationships shown. It is then possible to add/remove specific nodes, expand a node to its neighboring relationships and apply lenses to alter the appearance of the graph. The lenses that come with Siren allow for visualizations such as changing the size or color of the nodes based the value of one of its fields, adding a glyph icon to the node, or changing the labels. It also supports custom lenses to be developed via scripts.

    In the screenshot above, I started with a single person and then used the aggregated expansion feature to show all the people that person had emailed. The edges represent the number of emails sent. I then took it one step further by expanding each of those new nodes in the same way. The result is a graph showing a subset of people and the communication between them.

    Obstacles

    As this was my first foray into both Elasticsearch and Siren, I faced some difficulty in loading the data into ES in such a way that it would be useful in Siren. In addition, the data was not entirely clean given that some of the email addresses were represented differently between emails even though they were for the same person. There were also many duplicate emails since they were stored in multiple inboxes, but there was no clear ID to link them and thus filter them out.

    Apart from the data issues, I also had some difficulty using Siren. While the initial setup is not too difficult, there are some details that can be easily missed resulting in unexpected results. For example, when I loaded my ES indices into Siren, I did not specify a primary key field. This is required to leverage the aggregated expansion functionality in the graph browser, but I didn’t find anything in the documentation about this. I also experienced some odd behavior when using the graph browser. Eventually, I contacted someone at Siren for assistance. I had a productive call in which I learned that there was a newer patch release that fixed several bugs, especially in the graph browser. They also explained to me how the primary key of the index drove the aggregated expansion list. Finally, I asked how to write custom scripts to create more advanced lenses or expansion logic. Unfortunately this is not documented yet, and is not widely used. Most people writing scripts just make modifications to the packaged ones currently.

    Final Thoughts

    In the short amount of time that I spent with Siren, I could see how it can be quite useful for linking different data sets to find patterns and relationships. With the provided Siren platform docker image, it was easy to get up and running. Like with any new technology, there is a learning curve to fully utilize the platform, but for Kibana users this will be minimal. The documentation is still lacking in some areas, but the team is continuously updating it and is readily available to assist new users with any questions they may have.

    For my test use case, I feel that my data was not optimal for maximizing the benefits of the relational dependencies and navigation, but for another use case or a more robust data set, it could beneficial. I also did not delve into the monitoring/alerting functionality to see how that could be used with streaming data to detect anomalies in real-time, so that could be another interesting use case to investigate in the future.

    References