Maintenance, Operational Concerns, and Cost

So much of our time as software developers is spent maintaining existing applications and services. I’ve always taken this as a given and treated it as an annoyance as opposed to “real work”. I don’t want to maintain someone else’s old and crappy code! I want to build new things and “do them the right way”. Looking at code that’s as old as the company and trying to make sense of it all is depressing and a total time sink.

Throughout the past few years, my views on software maintenance have changed pretty dramatically. I’ve realized that successful applications in an “Enterprise” environment spend far more time in maintenance mode then they do being actively developed. Consumers of your application don’t care that you used some cutting edge framework or state-of-the-art architectural patterns. They just care that it works and continues to work well.

Sure, these new techniques can be great and I will never discourage developers from trying to optimize the application development process, but operational standards also need to be taken into consideration. Have you given any thought to how your application will look 6 months from now? 1 year from now? 5 years from now? Years after you and your entire team have left the company?

It’s a pretty intimidating task and one far too many choose to simply ignore. Software maintenance is an expensive and time consuming process. Many organizations spend non-trivial amounts of time and energy trying to figure out how to reduce overall maintenance costs. Having to ask these kinds of questions are a pretty solid sign of both technical and process debt. Sounds like its time to make a large payment with interest!

Where Do I Even Start?

Simple really. Define your operational concerns as everyone’s concerns. Just because an application is deployed doesn’t mean developers are off the hook for everything but stacktraces. We must put to rest thought processes like this, they are counter-intuitive and hurt everyone.

Stacktrace or get

Yes comics like this are amusing but conversations like this happen in more organizations than we’d like to admit. Have you ever done this? I know I have (sorry previous team members).

Operational Concerns are Everyone’s Concerns

I’ve seen questions like this asked plenty of times.

The application has been deployed and is currently serving requests. Isn’t the rest an ops problem?

Answer: NO!!!

The job of an operations team is to maintain the infrastructure that allows us to deploy our applications and run our business. Deployment, networking, server health, OS upgrades, etc. While some cross-training should be expected they should not be expected to understand every fine grained detail of how every service in our platform works. Operational concerns are just as much our responsibility as theirs.

Before you consider your application “feature complete” and “live” can you answer the following questions?

  • Does the application have a defined and agreed upon SLA? Who is responsible for ensuring the SLA is being met? Are you being proactive or reactive?
  • What is the expected turnaround time for security patches or library updates? When these issues are reported are they high priority for the responsible teams?
  • Who is on the on-call rotation? Is there enough documentation for them to effectively address problems despite having no involvement in the application’s development? If they don’t answer within the SLA how does it escalate? What if the next person don’t answer either?
  • Does the application have an easy to find dashboard? What verifies that it is correct? Is it based on logs, statsd, both, something else?
  • Will teams receive an alert if there is an unexpected slowdown, a drop in expected connections, or a spike in CPU or memory usage?
  • Can someone look at logs and understand what is happening with no knowledge of the codebase? What kinds of things are being logged?
  • Are logs being written to standard locations in a consistent format?
  • Does this application have a playbook for dealing with production issues? Who is responsible for keeping it up to date?
  • Can someone find the playbook easily?
  • Can someone find the codebase easily?
  • Can someone find additional documentation easily?
  • Can someone deploy hotfixes without tracking down a gatekeeper at 3AM on a Sunday?
  • What if we need to scale up the application? Has that been tested?
  • What happens if every cache server (varnish, memcached, etc) goes offline for an hour?
  • What happens if the entire team is in the mountains with no internet access and the application goes down?
  • In the event of an outage or degradation how long will it take before we can expect a post-mortem? Are they published and stored in standard locations? Who is responsible for ensuring these take place?

Isn’t that quite the list? After a few years of application development and maintenance I have had to ask teams every single one of those questions with varying degrees of answer quality. Yes that list is a little extreme but remember that computers are precision machines and even the slightest glitch can bring our beautiful technological wonder to its knees. We have to be ready to handle any kind of emergency and a great way to ensure this as developers is to make sure we thoughtfully design and build our applications to account for long-term operations.

The Straight Talk Express

If answering and addressing all these questions is “overkill” for an application then we need to ask ourselves some pretty straight-forward questions. Answers here should be easy. There is no right and wrong.

  • Does the business really depend on this application?
  • Does your team care of it goes offline (getting yelled at by management doesn’t count)?
  • If the application goes offline who will care?
  • Would it make more sense for teams that care to own this application?

After answering all of the following questions you should have a pretty clear idea about the direction and future of your application. Perhaps it is time to transfer ownership or perhaps it is time for this application to be decommissioned.

Ownership shouldn’t be a matter of pride either, we must always do what is best for our platform. Sometimes teams move in different directions and applications have to be shuffled around.

The End?

Repeat after me.

Software is not done until it is decommissioned.

Every moment your application is online and fulfilling requests it is not finished. It is still alive and its heart is still beating. As long as this remains true we must account for ongoing operational and maintenance costs.

Think about these things ahead of time and I guarantee costs will be reduced. Your work environment will be just a little bit more sunshine and rainbows and you can continue working on your new and exciting projects.

Special thanks to Colin Dean and Zack Zlotnik for reviewing and proof reading this post.

cross-posted from Jon Daniel’s Blog