Risk is typically expressed as a function of likelihood and severity or impact. Probability is an important factor but this article is about understanding the potential impact of change on a production software environment.
All changes carry an element of risk. Key to managing that risk is understanding and quantifying the risks inherent in a change. Not all changes present the same level of risk. A single character change in an HTML file or template is much lower risk than the same level of change in a configuration setting that affects the behavior of the entire application or service. Let assume that each cause a problem. In the first case a error affects a single view. In the later the impact could be the entire application crashes if the configuration setting is unexpected in production.
Just categorizing by the type of file is insufficient. If that one small HTML change makes the login page break then it could have at least the same impact at the configuration file change.
By taking deliberate steps to quantify risks we can add additional measures to the development process in a more focused way.
Risks can be reduced. Building automated tests around the changed behavior reduces the risk that the change will not behave as expected. Adding inspection processed (code reviews) or pair programming help reduce the risks of unexpected behavior but still there is an element of risk.
Small changes are lower risk than large
Tested changes are lower risk than untested changes
Changes with more dependencies are higher risk than those with fewer
Continuous Delivery (CD) and Test Driven Development (TDD) are techniques that significantly reduce the first two issues. Having rapid recovery practices can, to some degree, reduce the impact of the third.
In a typical Continuous Delivery build and release pipeline each change is assessed through automated testing. Each change is added to a strong known foundation because the pervious version is of known quality: passing tests. With continuous deployment each incremental change is deployed into production reducing deployment risks.
Are we there yet?
For organizations working towards continuous delivery and deployment the risk of problems occurring in production is typically much higher during the transition because the testing just isn’t there yet.
Existing software was not developed with Continuous Delivery in mind. Automated test coverage might be low and provide little confidence. Releases often contain a large number of changes that have been tested as a complete set. As delivery cycles are reduced more attention needs to be applied to the release. Given limited testing capacity risk evaluations provide a valuable tool for focus efforts.
Some code can be particular difficult to change without causing problems. Some code are bug magnets. In some cases is may not be worth going through a significant refactoring effort to make the code more maintainable. If a particular file or module has caused the last 9 out of 10 problems then statistically another change is 90% likely to cause a problem. We should be using bug data to predict the likelihood of a future problem and pay particular attention to testing that area.
- Quantifying risk input sources
Assessment of impact by affected source type (HTML/JS/Java/SQL/Config)
Assessment of dependencies - how many other areas depend on the change area
Defect resolution data for areas that have proven problematic in the past.
Implementing a risk assessment
Lets first assume that the potential impact in a production change is the sum of the potential impacts of the files changed as part of the change. To quantify the risk we first need to know what files were changed and then an assessment of the potential impact of those file changes so we can sum them up to an overall risk assessment.
One of the most complex parts to implimenting a risk assessment tool is knowing what changes to include.
We all know that software systems need to be versioned but that versioning needs to be traceable to the original version control revision and branch. Only then do we know the files changed between that revision and the revision last deployed in the environment.
Environment changes often involve multiple components. Decomposing systems into smaller parts using techniques and designs like microservices help to contain impacts to a single source repository. This significantly improves our ability to assess deployment risk.
Monolith or microservice from a risk point of view the problem is the same. Compile a list of files that have changed since the version deployed in that environment. Assign a risk value based on the file type or role and then sum up all those changes.
Reverse tracing changes
Some teams use the revision ID as the build number or least significant part of the version number. In git that means using the commit sha (sort or long), but what about components or services built from multiple repositories? If a deployment involves multiple dependent system changes it is common to define the deployment using a release manifest. A list of components and their versions. Armed with the list and versions that can be traced back to version control we can identify all the files that have changed.
Another approach is to add the versions into the asset itself either as a file containing all the contributing revision numbers and repositories or some form of meta data in properties. Lets assume that a deployable component is built from multiple source code repositories. The dependencies could be binary as libraries or through source as sub-modules.
Yet another approach is to add the risk assessment to the component. This last one seems attractive but relies on the production (last known good) version to remain stable. If an interim release is made then the assessment is out of date (although arguably it is airs on the side of pessimism).
Once the two revisions are know a simple diff generates a list of
changes. For git adding the
--name-only flag produces a list of
files that have changed between the two revisions.
From the list we can apply recognition patterns for files that are considered higher risks. The patterns can be generated by agreement within the team on areas that are particularly sensitive to change or from past experiences and bug reports.