Avoid credentials in environment variables

2024-Jan-21

Get them directly from a secrets manager that supports versions

In the Twelve Factor App, section III. Config, Wiggins writes that config should be stored in environment variables, including "Credentials to external services such as Amazon S3 or Twitter".

However, this neglects one important aspect of service-to-service credentials: rotation. Credentials are one of the few configuration values that need to be regularly changed in running systems.

To rotate a credential stored in an environment variable, you need to:

Create the new credential with the other service. Do not overwrite the existing credential [1]
Update your deployment configuration to put the new credential in the environment variables
Restart the entire fleet [2]
Check the service is still healthy
Retire the previous credential

There's two problems here: If your service is slow to start, then restarting the whole fleet takes quite a long time. Secondly, rollback is slow and uncertain.

It's easy to dismiss the first as "incorrect service design", but that ignores the reality that we still need to operate these systems. Some teams may be in "keep the lights on" mode for some systems. Investing more in these systems is not financially justifiable, so they don't get upgraded. That's just life.

Secondly, rollback is a critical control to minimise customer impact. It should be easy, quick, and safe. If you're running a rolling restart you can stop it, but rolling it back means you need to put the old credential back and start again. You might not have the old credential because many services (rightly) refuse to show credentials, instead showing you the credential only once on creation. Worse, if your service boots slowly, your rollback may take a long time, increasing customer impact.

What should you do?

Use a secrets manager that supports versions and put the name or ID of your secret in an environment variable. Have the process read that secret from the credential manager regularly, e.g. once every 2 hours plus/minus 10 mins jitter. This makes the fleet gradually pick up the latest secret without a costly restart. The system should try to get the latest credential more often if it sees authentication failures.

If you see an alarm, set the old version as current and the fleet should pick up the correction quickly. You can and should automate this as part of your renewal process.

Use the "last used" date for the old secret to check if it's safe to retire the old version. If it never gets to "more than 4 hours ago", then you know something is still using that secret. If sharing secrets between services is common at your company, you may want to wait a lot longer to see if anything else uses that secret, like an infrequent batch process.

This assumes your service offers the "last used" date on credentials, which might not be true. If it's not true, you'll want to wait long enough that every box in the fleet should have picked it up, then try to archive the secret. If alarms fire, then unarchive the secret.

This is why AWS IAM gives each user two Access Keys and shows the last used date.

In sum, the rotation process looks like:

Issue a new credential and save it as the latest version in your credential manager.
Wait for the fleet to gradually pick it up.
If you see an alarm within a pre-definied time period, set the previous credential as the latest version in the credential manager.
Revoke the credential if the LastUsed datetime is far enough in the past.

Why regularly rotate credentials?

The simple reason is that you don't know if a credential has been leaked. For example, a staff member may've accidentally pasted it into chat. Automatic renewal bounds the window of vulnerability. You can sometimes detect these leaks with tools like Trufflehog, but they're not perfect and you can't check everywhere.

The deeper reason is that you need to test your systems to make sure they're still working. Finding your renewal process is broken during an ongoing security incident is quite unpleasant.

When does this advice not apply?

You can safely ignore this if your system has two properties:

Your credential storage system supports versions
Your service uses short-lived processes, such as one-shot ECS tasks or AWS Lambda functions.

In both cases, you want to inject credentials using valueFrom (or similar) with a secrets manager ARN. On process start, the platform reads the credential fron the given ARN, using the latest version, and passes it in the given environment variable. If you rotate the credential and fresh processes start failing, you can mark the old version as current in your secrets manager and new processes will recover.

Long-lived connections, like JDBC connection pools, are a problem. You have to reconnect when you get a new credential. That might be a pain because your connection pool might not let you cycle connections. I don't know of any public JDBC driver that allows you to change the connection credentials once set. In this case, you may need to restart the service, but systems with JDBC connection pools are often the least-amenable to restarts. This is less dangerous when you use versioned secrets and a rolling reboot, like only restarting 10% of the fleet at a time while monitoring alarms.

This is also only for service-to-service credentials. Don't impose this on people who have to remember and type these credentials.

Who supports versioned credentials?

I know that Google Cloud Platform Secret Manager, AWS Secrets Manager, and HashiCorp Vault KV version 2 support versioning. I'm unsure about Azure's offering, though the docs do reference versions.

I haven't used Kubernetes Secrets, but it looks like they do not support versions, but I might be missing something in the docs. If true, using this scheme is difficult. You might be able to hack something in using secret IDs that include a version, like SecretId.Current and SecretId.Previous, and have your application check both versions when given only the SecretId in an env var. It may be easier and safer to ignore Kubernete Secrets and use only the products and services which offer versioning.

A note on encryption-at-rest keys

At rest encryption is a special case which is beyond the scope of this article. Both the Linux Unified Key System (LUKS) and FreeBSD's GEOM Based Disk Encryption (GBDE) allow the user to use multiple passphrases. You'll need a similar feature to build a renewal system. Avoid building this yourself and enlist a professional cryptographer if you must. Many file, block, and object-storage systems (e.g. S3, EBS) are encrypted-at-rest and offer key rotation.

Conclusion

I am fond of the Twelve Factor App guidelines, but this recommendation does not apply "cleanly" to all systems. Credentials need frequent renewal to check the renewal system works, and we can't change environment variables in a running process. If a fleet-wide restart is painful, then environment variables are not the right tool for giving credentials to the running application. Consider integrating directly with an existing secrets manager, like AWS Secrets Manager, Google Cloud Secret Manager, or HashiCorp KeyVault Version 2.

You don't need to worry about this with function-as-a-service or other short-lived processes. They restart very frequently, so you can update the environment variable using valueFrom so the platform gets the latest credential version from the secret manager and injects it as an env var.

Finally, avoid any secrets management system that does not offer versions. Versioned secrets are essential for safe key renewal.

Footnotes

[1]	This assumes the service you're integrating with allows more than 1 active credential.

[2]	I'm assuming you gradually restart the fleet while monitoring your alarms. If not, you may have bigger risks to worry about.