A good friend of mine, a senior Oracle DBA, recently looked at the automated database health check report we provide our customers and pointed out an interesting fact. He said "out of the 40 or so metrics you guys look at, at least 10 are outdated!" When I asked him to expound further, he responded “well, your report is still looking at cache hit ratios and such. The prevailing sentiment in the DBA community (especially the Oracle DBA community) is that hit ratios are useless. What truly matters are the wait stats and trace output.”
Now, does it really? I queried about a dozen of my other DBA friends on what they thought and unfortunately, it was a mixed message. Hours of Google searches showed a disappointing trend: the term “health check” is somewhat abused and there are no two similar health check utilities out there. Every DBA group that uses one seems to kinda make one up on the fly. I wish there was a standard health check that one could rely on.
Anyway, my limited research makes me wonder – why isn’t there an easily available standard health check? And more importantly, what really constitutes an “ideal” health check? Here are my two cents on the topic:
An ideal health check should provide an overview of a database’s stability across three major areas for consideration:
- Availability
- Performance
- Scalability
Now, one could argue that all 3 of the above areas just point to one more area: performance. However from a practical standpoint, I would like to treat these 3 areas as distinct. For instance, in my mind, the well-being of a hot standby maintained via a Data Guard physical configuration is required for ensuring high availability, not necessarily for performance; whereas, ensuring sufficient buffer caches have been configured helps facilitate optimal performance moreso than availability. Also, in certain cases, performance and availability related actions may be at odds with one another. For instance, one may configure a redo-log switch every 10 minutes to ensure the primary database and hot standby are within 10-15 minutes of each other (helps availability), whereas the same action may result in higher I/O and negatively impact performance (especially if the system is already I/O saturated).
So, going back to the “ideal healthcheck”, one should ensure that there are no existing or upcoming issues that are impeding or are likely to impede any of these three areas. Below is a “top 40” list of things that would help ensure that is indeed the case. Comments are placed in-line, next to the statistic name/type, in italics.
Category: Availability
1. Database space (Should consider both database space and OS space; would be ideal to also consider growth trends.)
2. Archive log space
3. Dump area space (bdump, cdump, adump, udump, etc.)
4. Archive logs being generated as per SLA (Note: This requires the health check utility to have an understanding of the SLA pertaining to standby database upkeep.)
5. Archive logs being propagated as per SLA
6. Snapshot/Materialized view status
7. Status of DBMS Jobs
8. Replication collisions
9. Backup status
10. Online redo logs multiplexed (On different mount-points.)
11. Control file multiplexed (On different mount-points.)
12. Misc errors (potential bugs) in alert.log
Category: Performance
13. Disparate segment types (tables, indexes, etc.) in same tablespace
14. SYSTEM tablespace being granted as default or temp tablespace
15. Temporary tablespace not being a true temp tablespace
16. Deadlock related errors in alert.log
17. Non-symmetric segments or non-equi sized extents in tablespace (For dictionary managed tablespaces.)
18. Invalid objects
19. Any event that incurred a wait over X seconds (“X” to be defined by user during healthcheck report execution. Default value could be 5 seconds. Obviously, for this value to be available, some kind of stats recording mechanism needs to be in place. In our case, Data Palette is used to collect these stats so the health check report can query the Data Palette repository for wait events and corresponding durations.)
20. Hit ratios: DB buffer cache, redo log, SQL area, dictionary/row cache, etc. (While there is a mixed opinion on whether these are useful or not, I like to include them for DBAs that do rely on them to identify whether any memory shortage exists in the database instance and adjust the related resource(s) accordingly. While I do have an opinion on this matter, my goal is not to argue whether this stat is useful or not, instead, it’s to provide them to people that need them (and there are quite a few folks that still value hit ratios.)
21. I/O / disk busy (I/O stats, at the OS and database levels.)
22. CPU load average or queue size
23. RAM usage
24. Swap space usage
25. Network bandwidth usage (Input errors, output errors, queue size, collisions, etc.)
26. Multi-threaded settings (Servers, dispatchers, circuits, etc.)
27. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
Category: Scalability (Note: For ensuring there are no scalability related issues, the health check generating mechanism ideally should be able to relate to current resource consumption trends and apply predictive algorithms to discern whether there will be contention or shortfall. In the absence of such predictive capabilities, a basic health check routine can still use thresholds to determine whether a resource is close to being depleted.)
28. Sessions
29. Processes
30. Multi-threaded resources (dispatchers, servers, circuits, etc.)
31. Disk Space
32. Memory structures (locks, latches, semaphores, etc.)
33. I/O
34. CPU
35. RAM
36. Swap space
37. Network bandwidth
38. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
39. Understanding system resources consumed by non-DB processes running on the same server/domain (3rd party applications such as ETL jobs, webservers, app servers, etc.)
40. Understanding system resources consumed by DB-related processes running outside their normal scheduled window (Applications such as backup processes, archive log propagation, monitoring (OEM) agents, etc. This requires the health check utility to know which processes are related to the database and their normal execution time/frequency.)
The health check utility can show the above areas in red, yellow or green depending on whether any statistics needs urgent attention (red), needs review (yellow) or is healthy (green). Accordingly, the health of the database can be numerically quantified based on where each of the statistics show up (a statistic being in green status earns it 1 point, a yellow earns it 0.5 and a red earns it 0).
The above list can be expanded to accommodate additional configurations such as a custom standby setup or database audit requirements, and statistics pertaining to more complex environments such as RAC and Advanced Replication. Even entire categories can be added as appropriate. For example, “Security” would be a good category to add with statistics pertaining to user related information including account lock outs, and password change status, tablespace quotas, audit usage, virtual private database configuration, and so on.
Database and instance level health checks need to be shown separate, regardless of RAC. Things that need to show up at the database level would be statistics such as database space, etc., whereas statistics such as memory settings and hit ratios would be in the Instance health check report.
Lastly, it would be nice if the health check could include a recommendation on how to resolve a statistic that’s showing up in yellow or red, or even better, offer a link to an automated standard operating procedure that would help fix the situation.
I have meant this to be an open list. So any thoughts from the community on further augmenting (or better yet, standardizing) this list would be appreciated.