An IT delivery factory

Groups

In my last blog, I showed what happens if you are hit by bad news. In a lot of cases, it is hard to re-use the old, while maintaining trust, as it is often unclear how deep the infection went.

When it is time to rebuild, the first step is to create a framework for trust, to ensure the people rebuilding are indeed the people you want to be rebuilding. Instead of some adversary that is happy to help out re-creating a backdoor. In order to do this, you need to set up a framework of groups, roles and rights, which is relatively static and can be re-created automatically.

In my blog post on Enterprise IT I presented a model to define the groups you can identify:

Deze afbeelding heeft een leeg alt-atribuut; de bestandsnaam is afbeelding-1.png

for each or the dark blue blocks, you have a Domain Architect. The Domain architect is part of the enterprise architecture team and makes sure all developments in an architecture domain align with the developments in other domains.

Within each Domain, you have one or more Platform Designers, like a network architect, a VMware Architect or a Storage Architect in the datacenter Domain. The main responsibility of this Platform Designer is to lead all development efforts and make sure all developments in the domain are aligned with the Domain Architect and the other Platform Designers. As such you could view the Domain architect as the “architecture lead” for the Technology domain and the Platform Designers as the engineering team to make it happen. Together they author the High Level Design for the platform, explaining not only the solutions to be engineered, but also the process that has been followed and decisions that have been made to come to those solutions.

This role is purposely separated from the team lead role or HR role. You want your Platform Designers to be able to firmly disagree with the Domain Architect without repercussions for their carreers. In organizations where they do not make this separation, you often see stagnating innovation…

The Platform Designers cannot do this alone. The technical knowledge required quickly goes too deep and therefore a specialist is required to translate the engineering guidelines to designs and implementations. It is important that both Platform Designers and Platform Specialist form a close-knit team, where the designer focuses more on the structure for the platform and the Platform Specialist more on the operational aspects. When there are issues, the Specialist is probably the first to hear is, but the Platform Designer should want to be a close second. As a team, they are responsible for the Low Level Designs for the platforms.

Operations is done by Operators. Together with the Platform Specialist, they make sure everything runs smoothly.

You may wonder why changes are not mentioned. There is a very simple reason for this. Changes are no operational task and should not be performed manually. These should be automated, as this is the only way to trust that changes executed are fully tested before execution. This separation between RUN and CHANGE is one of the major drivers for regaining trust.

This leads to a structure like the one below:

As you can see, the structure is easy to understand and easy to implement. It can be nested several levels deep when the complexity of the underlying platform requires this.

The picture above is implemented in the CICD-Toolbox in Keycloak. This is a playground where I test implementations of the stuff I blog about. I chose to use Keycloak due to the extensive documentation, the fact that it is Open Source and that it is possible to get support from RedHat when needed.

Roles

The roles that need to be implemented depend on the application that is to be put under IAM. In order to get started, let’s take a look at some applications you’ll likely find in a CICD pipeline that will be used to re-deploy your infra:

git (source code/script/workflow repository, the workhorse for the developer)
Jenkins (orchestrator, does most of the boring work)
Sonatype Nexus (curated source of updates and report store)

The roles defined should be application specific, so they are only announced to the client requesting access on behalf of a user. This prevents rogue clients from silently learning all available roles.

git

Git is your repository where all the workflows and other scripts are stored. As can be expected, not everyone has access to each repository and thus every repository in git is subject to IAM.

In general, rights are read, write and admin. As you want to be able to set this on a per-repository basis, this implies that there needs to be a role for each repository right. So for the NetCICD repository, this boils down to:

git-netcicd-read
git-netcicd-write
git-netcicd-admin

You can go further in defining roles, but in general this will suffice.

The reason for the prefix git- in the role name is to be able to distinguish between roles with identical names for different applications.

These roles need to be defined in git and associated to the proper repository. NetCICD in this case. There may be way more complex group mappings, teams, organizations and other stuff in your git tool, but just keeping it simple also makes your IAM system manageable.

Jenkins

Jenkins has quite a well defined system for defining local roles:

This implies you can define roles locally. Luckliy all kinds of pattern matching is possible. When a user comes to the system using IAM, it will provide the role(s) assigned to the user, which will become properties of the user in Jenkins.

These roles can easily be linked to rights:

By linking roles to specific rights, the user can execute what is required and permitted.

Understanding and properly closing Jenkins can be tricky though. Especially with remote agents like those used in NetCICD, where a Jenkinsfile is used to define what Jenkins is going to do.

Sonatype Nexus

Just like Jenkins, Nexus lets you define local roles:

As you can see I have defined a Jenkinsagent role for a Jenkins Agent running in some remote lab to be able to push test reports to Nexus. I provided this role with the right to push data to the NetCICD reports repository:

As you can see, these applications make it easy to work with single sign-on (provided you configure it correctly).

Configuration of the systems

In order to be able to work safely, you need to get the sources of these systems from the respective vendors. In order to rebuild quickly, I would use Docker Containers. Why? Simple: the vendor maintains a proper version of the software including the latest security patches in the Docker Hub. When you get the software from there, chances are small that the software is tainted.

Configuration must be done as code: after all, you want to know what is configured when you are rebuilding. That is why the configuration in the CICD toolbox uses shell scripts and the vendor provided CLI to configure all systems if possible. Where impossible, I used a configuration-as-code plugin, for example for Jenkins. This plugin is provided on the vendor site with many installs and badges that let you validate the code.

Even though it is tricky, we’ll have to keep our fingers crossed at this point, as we do not yet have something to validate against. If you cannot live with this, download the docker images, plugins and code beforehand, scan and validate it and run from removable media when required.

Next steps

In the Next blog, we’ll dive into more foundational stuff. Before we can make the configuration of Keycloak in order to make the magic happen, we need to make sure we can trust keycloak as a source and make sure no eavesdropping can occur. After that wel’ll go into detail on OIDC and JSON Web tokens, how they are used and their role in transferring information from user to application.

So, there is this company that does everything OK. They have these nice revolving doors at the entrance that weigh you when you get in and out,

they have this massive security guard, you have to wear badges visible all the time and all the other things you are reminded of daily when you get to work. They even use tokens with a pin to log on in combination with the badge: something most companies do not even have.

They also have these nice red books, blue books and white books with all processes and procedures written out and a lot of people validating that these are followed to the letter. Everything is audited continually and they are even considering introduction of the Chaos Monkey.

In the IT department, everything is taken care of, the administration is up to date, due to tedious labor done by everyone. Deployment is done automagically and everything and most of all management is happy. Extensive monitoring systems make sure everything is seen. OK, it took a 50 m long videowall, but that turned out to be a massive marketing tool too.

But then disaster strikes. A ransomware attack. The videowall goes red, showing this massive ransomware notice on every system. Luckily the company is rich enough and those bitcoins are paid quickly, so recovery can begin. They think. But they were unknowingly hit by double jeopardy: the botnet vendor sold the attack more than once, and a new ransomware notice appears. From another criminal group. And again, the ransom is paid, because recovery must start.

But the problems have only started. The police got wind of the attack. And the Secret Service. And the press. And the customers. And one of their more avid suppliers suddenly cut all communication. Even after recovery, no-one wants to do business anymore, because it is said they can no longer trust the company.

To make matters worse, a lot of equipment is seized by law enforcement for analysis, as it is expected there is more to it. Everything, literally everything, stops.

Even though they seem to have done everything right, something must have been terribly wrong.

Sometimes reality bites. Badly…

Does this sound far fetched? Ever heared of the Swift hack? Swift, as you may know, is a payment processor between banks that was hacked in 2012. A lot of the equipment used was infected soo deeply that in the end all hardware, including routers and switches, needed to be replaced as they could not certify the depth of the attack.

Currently, the attacks are magnitudes more advanced and extensive than they were in 2012. Double jeopardy, triple jeopardy, infections via USB ports, Thunderbold: all are common these days. Not Petya stopped shipping and cost Maersk 300 million dollars to recover from. The Solarwinds hack hampered 18,000 companies, many of which were government agencies that need to replace everyhing. The Colonial pipeline hack deprived the US from fuel for a week.

The question is no longer: “Can we get out of this mess?” or “How can we prevent such a situation?” . The answer is respectively no and you most probably cannot. Not even the Chaos Monkey helps you out here.

The reality is that the adversaries are way more powerful and intelligent than the parties they are attacking. Individual black hat hacker groups have more money than any company (or all but the biggest countries for that matter), there are multiple groups fighting for dominance, and all play “the enemy of my enemy is my friend”.

Coping with disaster

So, there is this other company that has done everything OK. They know exactly how they built their company. They know there are things like ransomware, viruses and stately actors, and they know they will probably not be able to stop them.

They too have the revolving doors, the security guard and all the other, rather boring stuff. But more importantly, they think the other way around. It is not monitoring that keeps stuff running, but development and testing. They assume that what runs is tested and conforms to design, so reactive monitoring only has purpose to find predominantly hardware and to a lesser extent software failures. User error hardly ever happens. They have adopted the habit of recreating everything every two weeks. They know the most important tool hackers use is time. And they have taken that away.

In addition, they heuristically monitor any action executed and any outlier is the subject of further investigation. This way, they have only a fraction of all the monitoring data to investigate compared to regular monitoring. And if there is an alarm, this most probably is something important. All the time gained, they invest in further automation and machine learning.

On application level, it is the same thing. Apart from being able to install any application automatically within an hour, applications are designed in such a way that they can recover from partial failure. Everything runs as a container in a Kubernetes cluster, which implies that the application recovers automatically in the case of a failure, scales in case of high load and is scaled down in case business is slow. This is not only good financially, but also for the planet. This approach to be environmentally minimalistic has also turned out to be a massive marketing asset.

In order to create the situation above, everything is tested rigorously. Everything that fails in production leads to additional tests to be executed, as it was a scenario they had not yet envisioned.

Deployment is automated and can only be executed after successful testing and validation by the appropriate functionaries.

Does this sound utopic? Maybe, but it is the way “cloud” works. Internally that is.

So, how do you do this?

Just imagine you have to rebuild your company from scratch, what would you do? Would you still go for the first scenario, or would you, however hard it may seem, adopt the latter? Remember: you still have your machines. You still have your products. You know how to produce them and what you need for that. You also have the trained staff to do all this. In other words: all those things that take a lot of time to create, hire or procure are already there!

If you look back on the first company above, the real disaster revolves around one thing: trust. Or rather: the lack of it. Without trust, all efforts turn out to be pretty useless. Customers and suppliers run away, employees are in doubt to make any move, as they are not sure any of their actions would re-ignite Doom.

The opposite is true for the second company. They trust they can rebuild in case of a failure. They trust that whatever they can create is as safe as they can imagine it to be and they test for that. Because that is what they do anyway. It’s part of their DNA.

So the first thing to restore is trust. In IT, trust is often translated into who can do what at which time and can I find a trace of this. Or in short: Identification, Authorization and accounting. IAA in short.

Several models have been made to make sure the data for IAA can be stored properly, and most of them have in common that they take into account how often the data changes. And exactly this is what should be used as starting point when designing for failure.

As shown above, most of the complex stuff appears to be static data: which roles do you have, and which rights does every role have? If this is translated to the organization, you often see that these roles are mapped to departments doing things. In other words: this mapping is relatively static and easy to create as well.

The dynamic part is which user should be in which group. This may seem to be a challenge to re-create: it will have to come from a backup, but I can assure you every manager and team leader knows which groups he or she is managing and who belongs to which group. They can jot it down on a beer mat in 5 minutes if needed.

This implies that regaining trust may not turn out to be as difficult as it seems. Just take the big platforms you have and ask de vendor which roles should be present. You may even find this in the manual :). Next figure out how you can link rights to these roles en put that into a matrix.

From this you can also derive some basic design rules your enterprise architects might like:

Groups represent the organisation
People in groups can execute roles
Roles are application specific
Rights are action related

With this, you have the basis for trust. Now it comes down to implementing this in an architecture.

The role of a directory

The problem described above has been solved already: the structure above is the basis for directory services such as AD (AGDLP) and can be implemented easily in any directory service.

Accessing the data is done on two levels:

OS
On OS level you’ll find Microsoft AD, and kerberos, LDAPs and SSSD for linux.
Application
On application level, you’ll find things like SAML and OIDC

Both need to be provided for an IT environment to become operational, unless you’ll only have containers without shell access. In that case, no OS access is required and OIDC suffices.

Implementation in the CICD toolbox

In the CICD toolbox, only application level authentication is implemented. Keycloak is used for this purpose. The OS level equivalent, FreeIPA, is in the development branch.

What comes next

In the next blog, we begin making the data structure for the directory, defining some common groups and roles. For the applications in the toolbox, the roles to rights mapping is described as well.

In the blog after that, the roles required in development are added: git roles for a network development repo and Jenkins roles for kicking of the installation and testing.

Mark's blog

My views on enterprise and IT architecture

Category / An IT delivery factory

Rebuilding with trust