Splunk vs. ELK – The Risk Focus Way

Splunk vs. ELK – The Risk Focus Way

Executive Summary

Though Splunk and ELK offer similar features, choosing the ideal solution for a specific enterprise depends heavily on the use-case being contemplated and the organization’s capacity to integrate and build on top of technology platforms.

Splunk offers robust commercial level support and a broader set of plugins/apps than ELK. These offerings can lower the effort required to integrate Splunk into an enterprise and use it across a broad set of use-cases. For organizations with narrower requirements, ELK offers a compelling solution, with support for numerous data sources with standard plugins and apps. ELK’s open source nature allows teams capable of developing plugin code to build an excellent customized solution on the ELK platform.

In this paper, we evaluate Splunk and Elastic (ELK) along criteria that are relevant to medium and large-scale organizations. These include ease of adoption, integration into a heterogenous compute environment, scalability, support for public cloud infrastructure monitoring, performance, total cost of ownership, and features relevant to use cases of value for our clients.

Read our other Splunk piece, relevant to the build for large-scale environments. Read More →

Contents

– Why aggregate logs?
– Executive Summary
– Feature Set Comparison
– Ingestion/Integration Add-Ons/Adapters
– Applications/Pre-built reports
– Performance
– Maintainability and Scaling
– Cloud Offering
– Support
– Ease of Use and Training
– Total Cost of Ownership/Use
– Conclusion

Authors

Subir Grewal

subir.grewal@riskfocus.com

Lloyd Altman

lloyd.altman@riskfocus.com

Cary Dym

cary.dym@riskfocus.com

Peter Meulbroek

peter.meulbroek@riskfocus.com

Tara Ronel

tara.ronel@riskfocus.com

Download the Whitepaper

About Risk Focus

We created Risk Focus in 2004, but our technical and leadership experience goes back much further. Our clients lean on us for our deep domain knowledge, unmatched technology expertise and fine-tuned problem-solving and delivery methodologies.

We have deliberately avoided breakneck growth, instead hiring only proven industry experts and curious, thoughtful technologists who are motivated by the variety and scale of the challenges they conquer in our client projects.

Data Masking: A must for test environments on public cloud

Data Masking: A must for test environments on public cloud

Eat your own cooking

Why mask data? Earlier this month, the security firm Imperva announced it had suffered a significant data breach. Imperva had uploaded an unmasked customer DB to AWS for “test purposes”. Since it was a test environment, we can assume it was not monitored or controlled as rigorously as production might be. Compounding the error, an API key was stolen and used to export the contents of the DB.

In and of itself, such a release isn’t remarkable; it happens almost every day. What makes it unusual is that the victim was a security company, and one that sells a data masking solution; Imperva Data Masking. This entire painful episode could have been avoided if Imperva had used their own product and established a policy to require all dev/test environments be limited to masked data.

The lesson for the rest of us is that if you’re moving workloads to AWS or another public cloud, you need to mask data in all test/dev environments. In this blog post, we will consider how such a policy might be implemented.

Rationale for Data Masking

Customers concerned about the risk of data loss/theft seek to limit the attack surface area presented by critical data. A common approach is to limit sensitive data to “need to know” environments. This generally involves obfuscating data in non-production (development, test) environments. Data masking is the process of irreversibly, but self-consistently, transforming data such that the original value can no longer be recovered from the result. In this sense, it is distinct from reversible encryption and has less inherent risk if compromised.

As data-centric enterprises move to take advantage of pubic cloud, a common strategy is to move non-production environments first; the perception is that these environments present less risk. In addition, the nature of the development/test cycle means that these workstreams can strongly benefit from the flexibility in infrastructure provisioning and configuration that public cloud infrastructure provides. For this flexibility to have meaning, dev and test data sets need to be readily available, and as close to production as possible so as to represent the wide range of production use cases. Yet, some customers are reluctant to place sensitive data in public cloud environments. The answer to this conundrum is to take production data, mask it, and move it to the public cloud. The perception of physical control over data continues to provide comfort (whether false or not). Data masking makes it easier for public cloud advocates to gain traction at risk-averse organizations by addressing concerns about the security of data in the cloud.

Additionally, regulations like GDPR, GLBA, CAT and HIPAA impose data protection standards that encourage some form of masking in non-production environments for Personal Data, PII (Personally Identifiable Information) and PHI (Personal Health Information) respectively. Every customer in covered industries has to meet these regulatory requirements.

Base Requirements

Masking solutions ought to provide some number of the following requirements:

  1. Data Profiling: the ability to identify sensitive data across data-sources (eg. PII or PHI)
  2. Data Masking: the process of irreversibly transforming sensitive data into non-sensitive data
  3. Audit/governance reporting: A dashboard for Information Security Officers responsible for meeting regulatory requirements and data protection

Building such a feature set from scratch is a big lift for most organizations, and that’s before we begin considering the various masking functions that a diverse ecosystem will need.

Masked data may have to meet referential integrity, human-readability or uniqueness requirements to support distinct test requirements. Referential integrity is particularly important to clients who have several independent datastores performing a business function or transferring data between each other. Hash functions are deterministic and meet the referential integrity requirement, but do not meet the uniqueness or readability requirements. Several different algorithms to mask data may be required depending on application requirements. These include:

  1. Hash functions: e.g., use a SHA1 hash
  2. Redaction: (truncate/substitute data in the field with random/arbitrary characters)
  3. Substitution: with alternate “realistic” values (a common implementation samples real values to populate a hash table)
  4. Tokenization: substitution with a token that can be reversed, generally implemented by storing the original value along with the token in a secure location

Data Masking at public cloud providers

AWS has several white-papers, reference implementations, including:

However, none of these solutions address masking in relational databases or integrate well with the AWS relational database migration product, DMS.

Microsoft offers both versions of its SQL Masking product on Azure:

  • Dynamic Masking for SQL Server: which overwrites query results with masked/redacted data
  • Static Masking for SQL Server: which modifies data to mask/redact it.

For the purposes of this discussion, we focus on what Microsoft calls “static masking” since “dynamic masking” leaves the unmasked data present on the DB, failing the requirement to shrink the attack surface as much as possible. We will also focus this discussion to AWS technologies to explore cloud-native versus vendor implementations.

Build your own data masking solution with AWS DMS and Glue

AWS Data Migration Service (DMS) currently provides a mechanism to migrate data from one data source to another, either at one time or via continuous replication as described in the diagram below (from AWS documentation):

Build your own data masking solution with AWS DMS and Glue

DMS currently supports user-defined tasks that modify the Data Definition Language (DDL) during migration (eg. dropping tables or columns). DMS also supports character level substitutions on columns with string type data. A data masking function using AWS’ ETL solution Glue could be built to fit into this framework, operating on field level data rather than DDL or individual characters. An automated pipeline to provision and mask test data-sets and environments using DMS, Glue, CodePipeline and CloudFormation is sketched below:

Build your own data masking solution with AWS DMS and Glue

When using DMS and Glue, the replication/masking workload is run on AWS, not in the customer’s on-premises datacenter. Un-masked or un-redacted data briefly exists in AWS prior to transformation. This solution does not address security concerns around placing sensitive data (and accompanying compute workloads) on AWS for clients who still gingerly approach public clouds. Still, for firms that look for a cloud-native answer, the above can form a kernel of a workable solution, when combined with additional work around identification of data needing masking and reporting/dashboarding/auditing.

Buy a solution from Data Masking Vendors

If the firm is less concerned about cloud-native services, there are several commercial products that offer data masking in various forms which meet many of these requirements. This includes IRI Fieldshield, Oracle Data Masking, Okera Active Data Access Platform, IBM Infosphere Optim Data Privacy, Protegrity, Informatica, SQL Server Data Masking, CA Test Data Manager, Compuware Test Data Privacy, Imperva Data Masking, Dataguise and Delphix. Several of these vendors have some form of existing partnership with cloud service providers. In our view, the best masking solutions for the use case under consideration is the one offered by Delphix.

Buy: Data Masking with Delphix

This option leverages one of the commercial data masking providers to build data masking capability at AWS. Delphix offers a masking solution on the AWS marketplace. One of the benefits of a vendor solution like Delphix is that it can be deployed on-premises as well as within the public cloud. This allows customers to run all masking workloads on-premises and ensure no unmasked data is ever present in AWS. Some AWS services can be run on-premises (such as Storage Gateway), but Glue, CodeCommit/CloudFormation cannot.

Database Virtualization

One of the reasons Delphix is appealing is the integration between its masking solution and its “database virtualization” products. Delphix virtualization lets users provision “virtual databases” by exposing a filesystem/storage to a database engine (eg. Oracle) which contains a “virtual” copy of the files/objects that constitute the database. It tracks changes at a file-system block level, thus offering a way to reduce the duplication of data across multiple virtual databases (by sharing common blocks). Delphix has also built a rich set of APIs to support CI/CD and self-provisioning databases.

Delphix’s virtualized databases offer several functions more commonly associated with modern version control systems such as git. This includes versioning, rollback, tagging, low cost branch creation coupled with the ability to revert to a point along the version tree. These functions are unique in that they bring source code control concepts to relational databases, vastly improving the ability of CI/CD pipeline to work with relational databases. This allows users to deliver on-demand, masked data to their on-demand, extensible public cloud environments.

A reference architecture for a chained Delphix implementation utilizing both virtualization and masking would look like this:

Database Virtualization

Conclusion

For an organization with data of any value, masking data in lower environments (dev, test) is an absolute must. Masking such data also makes the task of migrating dev and test workloads to public clouds far easier and less risky. To do this efficiently, organizations should build an automated data masking pipeline to provision and mask data. This pipeline should support data in various forms, including files and relational databases. Should the build/buy decision tend toward purchase, there are several data masking products that can provide many of the core masking and profiling functions such a masking pipeline would need, and our experience has led us to choose Delphix.

Building Large Scale Splunk Environments? Automation and DevOps are a must.

Building Large Scale Splunk Environments? Automation and DevOps are a must.

Introduction

Splunk is a log aggregation, analysis, and automation platform used by small and large enterprises to provide visibility into computing operations and as a Security Incident and Event Monitoring platform. It is a very mature product, with deep penetration in financial services firms. The Elastic-Logstash-Kibana (ELK) stack is an open source suite with a similar feature set.   

For a large organization, log aggregation platforms can easily end up ingesting terabytes of data. Both Splunk and ELK require incoming data to be indexed and rely on distributed storage to provide scale. In practice, this means dozens, or even hundreds of individual nodes operating in a cluster. Once an organization moves past a 2-3 node log-aggregation cluster, a strong automation framework is essential for testing and deploying all components of the log aggregation platform. 

Risk Focus has built large analytics clusters for a number of clients in recent years, including those focused on financial and healthcare data analysis. Several engagements have involved building large Splunk clusters. The automation challenge for log aggregation platforms are similar in many respects to other analytics platforms. In almost all cases the three major challenges are: 

  • Managing data ingestion/archiving 
  • Configuring and scaling compute clusters required for analysis  
  • Maintaining consistent configuration across entire cluster and robust testing 

Infrastructure automation

Based on our experience, Splunk clusters containing more than a handful of nodes should not be built or configured manually. Automation delivers configuration consistency and prevents configuration drift. It also improves efficiency in resource deployment and utilization. This is true for both the base system configuration as well as cluster deployment and configuration tasks. 

Organizations using current generation technology can deploy standardized base operating system images to virtual machines, speeding up initial infrastructure deployment significantly. An automation and configuration management tool (such as Salt or Ansible) can then be utilized to deploy software and customized configuration onto each node within the network.  Cloud orchestration technology (eg. Terraform or AWS Cloudformation) may be utilized alongside configuration management tools in particularly large environments. As an aside, AWS has a managed Elasticsearch/ELK offering. For organizations considering Cloud deployments, there is no need to re-invent the wheel, the AWS offering does virtually everything an organization might want in terms of infrastructure automation, including multi-AZ deployments for High-Availability.

Utilizing such automation frameworks makes it extremely easy to scale the environment up (and in certain cases down). It also simplifies other common management tasks critical for operational stability, including: 

  • Adding search-heads or indexersThis is easier with a well-understood automated deployment process not subject to manual error. 
  • Disaster Recovery: is easier and less costly to accommodate when compute resources are stood-up quickly and confidently with automated procedures. 
  • Resource EfficiencyWhen search, ingestion and reporting follow a specific pattern during the day, or different business units require additional capacity, infrastructure automation enables re-scaling of components and re-directing resources towards other tasks/nodes. 
  • Testing/Upgrading/Patching/Re-Configuring software: A consistent and modern DevOps practice is necessary to make modifications to a large cluster reliably and with minimal downtime. 
  • Security and Auditability: Financial services firms, health-care providers, and utilities face high regulatory burdens. Auditors and regulators have an interest in the computational operations of these clients. Splunk/ELK is a good source of data for audits and an indicator of mature operational management and monitoring practices within the information technology organization. As an organization begins to rely more on Splunk for both security monitoring and operational visibility, it can expect auditors and regulators to treat Splunk as critical infrastructure and take a greater interest in the Splunk environment itself. Employing automation and consistent DevOps practices to build, deploy and manage the Splunk environment goes a long way towards allaying regulatory concerns. But more importantly, to ensure operational stability, large Splunk clusters (or any other large-scale analytical/compute environment) should use automation for initial deployments and to manage configuration drift across the fleet. 

Splunk Administration Tools

Splunk has a rich toolset of GUI and CLI tools to manage configurations for indexers, forwarders, license managers and other components in a Splunk cluster. Most firms are likely to have some form of automation/DevOps standard across the organization, in aspiration if not yet in full practice. A key part of planning for a large Splunk environment is defining where the boundary between standardized, cross-platform configuration management tools end and where the Splunk toolset exercises control. 

There is no perfect answer to where this boundary lies.  Risk Focus works with clients who have limited the use of Splunk management tools to managing licenses and performed virtually every other task with scripted automation frameworks. There are other clients, who chose to go in the opposite direction and rely largely on Splunk’s tools to manage and configure the cluster. 

Making the right decision on tooling is dependent on the practices and skill-set of the team expected to support Splunk. Organizations that rely on external service providers (including Splunk PS) to maintain their environment need to consider whether these providers are familiar with their preferred automatioor configuration management toolkits. Organizations that prioritize mobility among technology staff will want to place more emphasis on common tooling across all applications, rather than rely on Splunk-specific management tools. 

Application Life Cycle

Testing and Quality Assurance 

An important and often underappreciated part of Splunk environment management is how to deal with the testing and release of software updates and patches, reports, dashboards, and applications. 

Organizations using Splunk for critical business activity should treat Splunk like any other business critical system and follow best practice for software delivery and maintenance. That means establishing processes to test software releases. A release can impact user activity, reports, dashboards, ingestion configurations, and system performance. Building Splunk test environments and efficiently rolling changes through a development/test/deployment lifecycle requires automation. Absent such automation, test cycles become expensive and will not be executed consistently. Mature organizations using a Continuous Integration/Continuous Deployment (CI/CD) lifecycle will find that the effort expended to integrate Splunk into their CI/CD pipeline delivers enormous rewards over time. 

Release Management 

In managing releases and updates to Splunk configuration, it is useful to view Splunk in a broader context as a data analytics tool. Most organizations have some experience with data analytics tools such as SAP, SaaS, Informatica, or Business Intelligence. For these systems, organizations have often established fine-grained control over reports and dashboards used in critical business activities. This includes limiting users’ ability to change them in production. Splunk is no different. We advise customers to clearly establish permission and ownership boundaries in their production Splunk environment. These should be balanced so that they do not constrain the Splunk users’ ability to analyze data. 

One way to balance the tension between user freedom and organizational oversight is to create a dividing line between reports/dashboards used in daily operations and ad-hoc analyses. We find that the following best practices are quite useful: 

  • Critical reports and dashboards should be tightly controlled via Splunk’s permission tools 
  • Changes to critical dashboards, however minor, should be tested 
  • Software upgrades should involve user acceptance testing and automated testing of all such dashboards 
  • Data from test environments should be ingested continuously into a Splunk test environment. Application/infrastructure changes which might impact Splunk should be tested here.  
  • Critical monitoring dashboards should be validated as part of application testing to ensure any changes to software/log format do not impact these dashboards. 

Example

Splunk scales well to meet the needs of large enterprises, but the topology can get complex with increased scale.  An example of a deployment framework for a multi-tenant Splunk cluster is below. Depending on your organization’s need, automated or semi-automated scaling can be built into the framework.

Conclusion

The current generation of automation tools and DevOps best practices can deliver significant benefits to an organization seeking to manage and maintain a large Splunk cluster. Every organization should carefully consider the benefits of using such tools to manage their environment. Organizations relying on Splunk for critical operational management should treat it as such and build a robust testing framework for their environment.