Getting Under The Covers: An Introduction to Salesforce Event Monitoring & Splunk
As part of a multi-tenant cloud platform, your Salesforce org can be a black box with limited information about what’s happening behind the scenes:
- Why is this page taking so long to load since our last deployment?
- How long does this trigger take to run in Production?
- Which processes are putting us at risk of hitting the platform or governor limits?
- Why is a user trying to log in 60,000 times in an hour (Yes, this really happened!)?
These are all relevant questions, with not-so-simple answers.
In Part One of this two-part blog series, we will walk through a real-world case study where we performed a deep dive into an enterprise-scale Salesforce org using Salesforce Event Monitoring and Splunk. In this engagement, we were tasked with identifying and solving the most pressing issues for an org that was experiencing some growing pains due to rapid increases in users, data throughput, and tech debt.
The Client & Where We Started
For an enterprise tech company that had experienced rapid growth, a poor-performing Salesforce org with inconsistent data was an untenable reality. It was a priority to stop the confusion, but where do we start? More tactical analysis with debug logs, browser dev tools, or soliciting good old user feedback to gain insights into the org could have been an option, but not necessarily the best since with these methods, it’s hard to quantify the depth and breadth of the problem:
- Users can tell us that they’re seeing errors “all the time,” but what does that mean?
- What is the specific error or errors they are experiencing?
- How frequently has it been happening?
- Where did it originate?
- Are all users having the same issue, or is it isolated to this specific user?
We answered these questions, and Event Monitoring was our secret weapon.
Event Logs: A Firehose of Information
Every hour, with the Event Monitoring license, Salesforce publishes detailed log files for the events in your org and makes this available for you to download. This is a subset of the extensive logging that Salesforce captures internally for their use.
Data such as login activity, Apex executions, trigger executions, Lightning interactions, errors & exceptions, Visualforce, API calls, and many more. Knowing that our client was already using Splunk for their non-Salesforce applications, we decided to put it to use for their Salesforce event logs as well.
Ingesting these event log files into Splunk allowed us to spot anomalies that were previously hidden. We could finally find answers to our questions of what’s going on inside the black box with specificity. All of this was invisible to us prior to using Splunk, and the intuitive UI and visualization options made our findings easy to understand for business users.
The Splunk App for Salesforce and the Splunk Add On for Salesforce, both available for free, allow you to analyze your Salesforce org right away. You will have dozens of out-of-the-box features such as pre-made reports, dashboards, lookups, and more:
We used these OOB features as a springboard to creating our own content that was more client specific. We looked at the queries Salesforce had written, as inspiration and guidance for our own queries. For example, we needed to look at the extent and causes of ongoing row lock errors within their org, and created the following dashboard to surface this information:
Salesforce Event Monitoring helped us gain insights into not only the specifics of the processes that were bogging down the org but also enlightened us to the sheer scope of the problem. Here’s a quick look into the state of the org and some of the most prominent error processes when we started:
- ~1,200 unable to lock row errors per day
- ~100 governor limit errors per day
- Many pages with >10 seconds load times
- ~120 null pointer exceptions per day
- ~5 million transactions per week
Instead of simply knowing that we were getting a lot of row locks, code exceptions, and governor limit exceptions, we needed to find out exactly where they were coming from.
Event Monitoring gave us visibility into which errors were most prominent and which users were most affected, allowing us to see details such as what class, method, flow, trigger, or line of code the error process originated from, the time of the error spikes, the transaction type (Future Method, Queueable, Bulk API, Synchronous, Trigger, Lightning, etc.) and more.
Once we found the most pressing issues, we created numerous dashboards and reports to track these processes. Dashboards gave our client’s business users reusable resources to track the health of the org based on every available metric.
Actionable Insights
Knowing how many errors we’re getting and plotting it in graphs can be helpful, but if we leave it at that, then it’s just fun facts unless we take action and do something about it! It’s time to take the data from event logs and turn it into actionable insights. Without steps to reproduce or specific details, how can we create tickets for our developers that can be actioned quickly and accurately? Typically the tickets would consist of an anecdote from a user saying they received a vague error message when trying to do something. Where does the developer start?
The insights surfaced by Event Monitoring allow us to enrich our tickets with concrete, specific data – frequency of impact, specific users impacted, stack traces, and more:
This reduced the “discovery” phase involved in every ticket for the development team and allowed them to be much more helpful when reaching out to specific end users for clarification. The wealth of information already compiled accelerates the path for developers to begin researching solutions. The class, method, line of code, user(s) affected, page URL, timestamp, total error count, and a full history of exactly when the problem started were all there.
Validating The Fixes
Not only could we see everything in the current state, but Event Monitoring also allowed us to measure the impact of our changes post-deployment. I.e. did what we changed actually fix the issue? Is it really having the impact we expect?
Perhaps we expect Ticket 1 to decrease the run time of the Contact trigger, ticket 2 to eliminate row lock errors experienced by TaskTrigger, or Ticket 3 to reduce Bulk API usage by 20%. We can find out in minutes if the changes worked or not, and quantify the impact a production deployment had.
Event Monitoring gave us visuals to quantify for everyone (even the most non-technical person) precisely how significant the improvements we’ve seen in org performance since our last deployment, over the last quarter, or over the last year. Explaining to executives why the work we’re doing is so important becomes simple with such clear visualizations.
We now have the ability to prove that the work we’re doing directly correlates to users spending less time waiting for Salesforce, and more time closing deals. With every closed ticket there was a corresponding graph, chart, or table proving its worth.
Next Steps
What is lurking in the shadows of your Salesforce org? How much information could you gain with Event Monitoring for your org? Reach out to us and we can help your organization as you embark on your own journey with Event Monitoring. Don’t forget to check out Trailhead to learn as well.
In Part 2 of this blog, we’ll take a look at how to take Event Monitoring to the next level, and proactively identify problems before users have time to complain.