By actively monitoring your ArcGIS Enterprise organization, you can stabilize system uptime, identify service performance issues or outages, and proactively adjust allocated resources across participating machines to run the underlying applications. Monitoring solutions can provide active checks for commonly used endpoints and alert the appropriate contacts when responses fall outside of expected tolerances. In addition, you can use them to collect historical information that can be used to corroborate with system and software logs during root cause analysis or postmortem investigations.
While you can use ArcGIS Monitor to monitor your ArcGIS Enterprise organization, there are also third-party tools that allow you to achieve similar results. The information below is a starting point for how to integrate monitoring solutions with ArcGIS Enterprise.
Monitor metrics
In general, there are two perspectives from which enterprise applications can be monitored: resource utilization and user experience.
Resource utilization is a familiar concept to those in systems administration in that it involves characteristics of the collection of machines and supporting infrastructure that run the enterprise software. These metrics typically scale proportionally with the volume of users accessing the platform, but some workflows may cause significant spikes in utilization as well.
Alternatively, user experience monitoring generally reflects how the client connects and interacts with front-end applications and is more familiar to business analysts and GIS administrators. These metrics are useful in determining baseline response times for a variety of requests, which can then be used to establish thresholds at which administrative teams should be alerted. There are also aspects of the user experience that require consideration outside of response times, such as SSL certificate expiration.
The subsections below describe monitoring a system from a resource utilization perspective.
Resource utilization
When monitoring machines in an ArcGIS Enterprise deployment from a resource utilization perspective, metrics to track include the following:
- Processor—When a participating machine's processor spikes or reaches 100 percent capacity, compute requests are backlogged, which may cause a delay in the return of information. This applies to any running process when experiencing a burst in activity.
- Physical memory—When physical memory approaches 100 percent utilization, running processes may crash as they attempt to expand into additional memory space. This is mitigated by the presence of virtual memory.
- Virtual memory—Virtual memory provides a buffer between the physical memory of a machine and the underlying storage. It uses part of the underlying storage to exchange data out of physical memory while keeping it more readily accessible than loading directly from disk. Adverse effects due to virtual memory exhaustion are common with Windows systems. Virtual memory expansion (page file growth) begins to occur at 90 percent physical memory utilization. In some configurations, virtual memory can be exhausted before physical memory begins reaching 100 percent.
- Committed memory—System committed memory capacity is the sum of the physical memory of a machine plus the virtual memory size at a given point in time. Since virtual memory can grow, the committed memory limit can change over time. A machine approaching 100 percent committed memory utilization indicates that both physical and virtual memory are being exhausted and more resources are needed.
- Disk volume available space—Running out of disk space for either the system, application, or data volumes on a system can have significant consequences for both the running operating system as well as any applications that depend on those volumes. Monitor available space to ensure that systems do not run out of the disk space as well as determine when there are significant increases in used space, which can be indicative of anomalous publishing events.
As you monitor your system, keep in mind that network bottlenecks, though becoming rarer in enterprise-grade network environments, can affect the optimal response times for ArcGIS Enterprise components. This becomes increasingly possible in a multimachine environment where multiple internal requests are exchanged between all ArcGIS Enterprise components and other registered data sources and file services.
When possible, divide the processor and memory into a per-process listing to determine which process is spiking during a given time. When using this level of granularity in monitoring, the command line portion of the process can be used to distinguish ArcGIS Enterprise internal components from each other or from real-time antivirus scanning, for example.
Monitor not only the machines on which ArcGIS Enterprise components are installed but also any file servers and database instances that the deployment may depend on for proper operation. ArcGIS Enterprise applications typically start at their lowest resource consumption levels. As applications are accessed and used, their resource consumption scales proportionately with resource utilization.
Collect resource metrics
You can use Windows Performance Monitor to collect system resource utilization data. This collection tool can be set up to capture various metrics, but the metrics added to a data collector set in the following example are listed above. You can run data collector sets on remote computers, so you can implement a central monitoring machine for a collection of metrics from multiple machines running ArcGIS Enterprise software.
Note:
While this is one example of collecting metrics, any monitoring software can be configured similarly to capture resource utilization metrics. See the software's documentation for additional information.
To set up a data collector set, do the following:
- Click Start > Windows System > Control Panel.
- Choose System and Security and click Administrative Tools.
- Click Performance Monitor.
- Expand Data Collector Sets and right-click User Defined.
- Choose New > Data Collector Set.
- Create a data collector set:
- Type a name for the data collector set.
- Choose Create manually (Advanced).
- Click Next.
- Check the Performance counter check box under Create data logs, and click Next.
- Click Add to log performance counters.
- Add a performance counter to collect data on processor utilization:
- Browse to Processor in the list of available counters.
- Expand Processor and choose % Processor time.
- Choose _Total under Instances of selected object.
- Click Add. The counter appears under Added counters.
- Add performance counters to track the remaining resource utilization metrics:
Browse to and expand Logical Disk and click % Free Space, then choose <All instances> of the selected object. Click Add.
-
Browse to and expand Network Interface and click Bytes Total/sec, then choose <All instances> of the selected object. Click Add.
Browse to and expand Memory and click % Committed Bytes in Use. Click Add.
Click Memory > Available MBytes. Click Add.
Click Memory > Pages/sec. Click Add.
- Click OK.
- Change the Sample interval value to 5 and change the Units option to Minutes.
You can increase or decrease this value depending on the preferred resolution of logging. Typically, when an issue is occurring, the sample frequency is increased to, for example, 15 seconds, while during normal operation, a frequency of 15 or 30 minutes may be adequate.
- Click Finish.
- Right-click the created data collector set under User Defined and click Start.
Note:
When a data collector set is running, you cannot see a real-time report. Stopping and starting the data collector set generates a report for the time between when the last report was created and the current time.Analyze resource metrics
Once you have chosen a collection tool and captured resource utilization data for your machines, you can analyze resource metrics. Consider the following when analyzing resource metrics:
- The life-span of the issue—Understanding whether the occurrence was an isolated event or long term will help you determine the best path forward in most situations. A short-term spike in resource utilization tends to occur with an immediate demand in specific services such as adding a newly released dashboard or web app or adding a department to the portal. Longer-term growth toward the current utilization can indicate increasing popularity of the platform and its associated services or applications. Short-term spikes may or may not recur, so the context surrounding those events is important in determining whether additional resources are needed to increase the long-term stability of the deployment.
- The processes consuming most of the system's resources—From a Portal for ArcGIS and ArcGIS Data Store perspective, utilization should scale almost linearly with the number of users on the platform and use of hosted services, respectively. When considering ArcGIS Server, scaling of dedicated services and use of hosted services are the two major factors in resource utilization. Dedicated services can be tuned in an ArcGIS Server site to reduce overall resource utilization, but that may not be adequate when demand reaches its peak over time.
- The distribution of roles—Distributing roles across multiple machines in an ArcGIS Enterprise deployment allows for a more careful resource adjustment for each component as well as increased granularity of understanding when issues arise. Increasing resources for only the relational data store or hosting server machines may be more strategic than increasing resources for a single-machine based enterprise deployment. You can make adjustments to the current site architecture through join site operations to move from a single machine to a distributed architecture in an established deployment.
Resolve issues
Now that you can identify, track, and analyze machine resource metrics, you can address unexpected system responses. This may mean increasing assigned processor resources, assigning or installing more RAM, or increasing disk space. Before taking action, you must understand the best practices for resolving resource utilization issues.
Processor utilization
Before increasing the assigned processor resources of the machines encountering high processor utilization, determine whether it is an ArcGIS Enterprise component or other software on the system that is causing the spikes in utilization. Security software with real-time scanning enabled can increase processor utilization during normal web server and database operations. If this is the case, alert your cybersecurity team based on the observed behavior. For virtual machines, the underlying host may be overprovisioned, which can lead to a performance bottleneck that is undetectable to virtual machines.
Physical memory utilization
When physical memory utilization approaches 100 percent, the machines may require more RAM assigned or installed. As described above, separating workloads on dedicated machines can allow for more granular resource allocation and reduce current resource contention, but you can also increase memory on the existing machines. When physical memory utilization approaches 100 percent, the available virtual memory may be exhausted as well.
Virtual and committed memory utilization
Virtual and committed memory utilization typically demonstrate the same patterns when reaching 100 percent utilization. Virtual memory allows for processes to use more memory than is available on a system and typically scales automatically to a threshold value unless set statically by the system administrator responsible for the provisioned machines. You may be able to increase virtual memory by modifying system settings if there is adequate disk space to extend the page file.
Disk volume available space
Disk space exhaustion is one of the most unpredictable failure methods that can occur in an ArcGIS Enterprise deployment. Files can be blanked or truncated when attempts to update are incomplete, which can prevent the software from starting properly. First, search for large files that can be moved to a registered data store or other location. If you cannot remove unneeded files, you must increase the disk space. You can also migrate the system directories to separate storage, such as the content directory for a Portal for ArcGIS site or cache directory for ArcGIS Server.
Note:
To see the top 25 files by size in the current directory, run this command in an administrative PowerShell window:
Get-ChildItem -r | Sort -descending -property length | Select -first 25 name, @{Name="Size (GB)";Expression={[Math]::round($_.length / 1GB, 2)}}
Running on the root volume can take a long time, so it is recommended that you browse to a particular directory before running the command.