One thing I have learned consulting at companies is that with the proper building blocks (update/original) in place, it is possible to avoid outages, improve end-user experience and motivate developers.
Quite often operations tools such as monitoring and analytics are useful for noticing problems with the infrastructure and applications, but with the proper usage and communication they can add value to an organization in many ways.
Under pressure to release a new feature, developers rush to migrate an application to a new framework that has not been run in production before. After testing carefully and working with the Ops team the new version is released on Friday by the Ops team while a Dev team member is present. Everything seems to run smoothly and the team heads off to enjoy a much deserved weekend off.
Because the Ops and Devs had worked closely to develop monitoring procedures before the release, the Ops team notices that application servers are occasionally hanging and setting off alarms in monitoring (Icinga). These clear quickly as the automated process manager (Monit) kicks in to restart the application, before the issue can be perceived by end users.
Since the entire infrastructure is under full configuration management control (Puppet) there is no back and forth between groups about things changing in the infrastructure. The system manifests are all under source control management (Git), so infrastructure changes are easily visible. In addition, the Ops team has a policy of no changes on Fridays. Due to the factors, the infrastructure as a culprit can be quickly ruled out and it is possible to quickly focus on the application layer.
The Ops team has been given the resources to spend on collecting data on all facets of the environment including OS level performance statistics (Graphite/Collectd). From the graphs it was easy to determine that the number of connections to the application was growing steadily and not being released. This was easily confirmed by checking a single application server that showed the high connections in the graph.