Source Code Secret

Spotlight: How Secrets Sprawl Undermines Software Supply Chain Security

In this Spotlight edition of the podcast, we’re joined by Mackenzie Jackson, the Developer Advocate at the firm GitGuardian. Mackenzie and I discuss the problem of so-called “secrets sprawl” – the migration of all manner of sensitive information, from credentials to private keys -into public source code repositories on sites like GitHub.

As always,  you can check our full conversation in our latest Security Ledger podcast at Blubrry. You can also listen to it on iTunes and Spotify. Or, check us out on Google PodcastsStitcherRadio Public and more. Also: if you enjoy this podcast, consider signing up to receive it in your email. Just point your web browser to securityledger.com/subscribe to get notified whenever a new podcast is posted. 

[MP3]


Given enough eyeballs, all bugs are shallow.” That is “Linus’s Law.” First formulated by Eric Raymond in his 1999 book “The Cathedral and the Bazaar,” and named after Linus Torvalds, the creator of Linux. It speaks to a hidden value of open source code: with an unbounded population of developers given access to source code, security and quality issues will quickly bubble up and be discovered, improving security rather than undermining it. 

Mackenzie Jackson is a Developer Advocate at GitGuardian

All Secrets Are Shallow, Too!

Two decades later, open source culture is now firmly entrenched, open source code and libraries are part and parcel of nearly every software development project, and massive, online repositories like GitHub put code at the fingertips of a population of millions of developers and billions of Internet users.

In that new milieu, something like a corollary to Linus’s Law has emerged: given enough eyeballs, all secrets are shallow, too. 

In other words: having thousands of developers crawling over your source code may expose hidden flaws in your application code. (Though there is ample reason to doubt that happens.) But it may also reveal secrets you weren’t aware were buried in your code, or that you hoped nobody would notice. 

Credentials: Gone in 60 Seconds

In fact, secret sprawl, as it is known, is a growing security risk for organizations of all types. Credentials leaked in source code were behind a massive security incident at the ride hailing firm Uber. And malicious actors are known to be on the hunt for API keys, SSH credentials and other sensitive secrets buried in source code. Experiments by researchers using “honey pot” credentials suggest that the window between a secret being published to a public source code repository on GitHub and those credentials falling into the hands of malicious actors may be measured in minutes, rather than hours, days or weeks. 

The likelihood of that happening is also growing, as developers use common platforms like GItHub to manage both personal and professional development projects, increasing the likelihood of cross contamination and security lapses.  And, once published, powerful commit history features on platforms like GitHub make it hard for development organizations to erase their mistake.

What companies need are tools to help them identify leaked credentials and other secrets before they get pushed to source code repositories. To talk about the dangers posed by secret sprawl, we invited Mackenzie Jackson into the studio. Mackenzie is a developer advocate at the firm GitGuardian, which makes technology to help detect and block secret sprawl via platforms like GitHub. 

Check out our full conversation above, or click on the button below to download the MP3.


(*) Disclosure: This post was sponsored by GitGuardian. For more information on how Security Ledger works with its sponsors and sponsored content on Security Ledger, check out our About Security Ledger page on sponsorships and sponsor relations.


Episode Transcript

[START OF RECORDING]

PAUL: This Spotlight edition of the Security Ledger podcast is sponsored by GitGuardian. GitGuardian helps secure the software development life cycle by automating secrets detection for application security and data loss prevention purposes. GitGuardian solutions monitor public and private repositories in real time, detecting secrets and alerting staff to allow investigation and quick remediation. GitGuardian helps developers, operations, security and compliance professionals secure software development, define and enforce policies consistently and globally across all of their systems. Check them out at GitGuardian.com.

PAUL: Hello, and welcome to a Spotlight edition of The Security Ledger podcast sponsored by GitGuardian. In this episode of the podcast…

MACKENZIE: Did you search on GitHub for removed AWS key as a message? No, you will get thousands of results. People will do that.

PAUL: Given enough eyeballs, all bugs are shallow. That is Linus’s Law formulated by Eric Raymond in his 1999 essay The Cathedral and the Bazaar and named after Linus Torvalds, the creator of the Linux kernel. It speaks to a hidden value of open source code. With enough people looking at it, security and quality issues will quickly bubble up to the top, improving security not undermining it. But with open source culture now firmly entrenched, open source code, part and parcel of nearly every software development project. And with massive online repositories of source code at the fingertips of a population of billions of Internet users, a corollary to Linus’s Law has emerged. Given enough eyeballs, all secrets are shallow, too. In other words, having thousands of open source developers crawling over your source code may expose hidden flaws, but it may also reveal secrets you weren’t aware were buried in your code or hope that nobody would notice. In fact, secret sprawl, as it’s come to be known, is a growing security risk for organizations of all types. Malicious actors are known to be on the hunt for API keys, SSH credentials, and other sensitive information buried in source code repositories. Experiments by researchers using Honeypot credentials suggest that the window between a secret being published to a public source code repository like GitHub and those same credentials falling into the hands of malicious actors may be measured in minutes rather than hours, days, or weeks. The likelihood of that happening is also growing as developers use platforms like GitHub to manage both personal and professional development work, increasing the likelihood of cross contamination and security lapses. To talk about the phenomena of secrets sprawl, we invited Mackenzie Jackson into the studio. He’s a developer advocate at the firm GitGuardian, which makes technology to help detect and block secret sprawl via platforms like GitHub. In this conversation, Mackenzie and I talk about the problem of secret sprawl within development organizations and the types of sensitive information the companies are inadvertently leaking via their published source code. We also talk about ways that companies can get their arms around the problem of secret sprawl. To start off, I asked Mackenzie to tell us a little bit about GitGuardian and the work that they do.

MACKENZIE: I’m Mackenzie Jackson, I’m the developer advocate or evangelist at GitGuardian.

PAUL: Developer Evangelist… that’s a great title actually. Tell us a little bit about the work you do, and also for our listeners who might not be familiar with GitGuardian, what GitGuardian does.

MACKENZIE: Yeah, for sure. So developer evangelist or developer advocate. Sometimes when I use the evangelist title, I get interesting messages on LinkedIn.

PAUL: Tell me about your deity.

MACKENZIE: Exactly. It’s code security. I’m interested in switching, but yeah. So as a developer evangelist or advocate, it’s essentially my job to teach and educate developers about code security. So the great thing is, I do that fairly independently from GitGuardian. We’re not part of the marketing or sales team so much, but trying to build awareness to the problems and then also kind of provide solutions for that. So a little bit about GitGuardian now is, well, we specialize in detecting secrets inside source code. And so when I refer to secrets, I’m really talking about digital authentication credentials. And these are typically things like API keys, security certificates, anything that’s meant to be used in a programmatical way that often ends up within our source code.

PAUL: So I think our listeners are probably quite familiar with GitHub. I mean, we’ve got a fairly technical listenership. GitGuardian might be new to them, talk a little bit about the GitGuardian platform, and kind of what the relationship is to a platform like GitHub.

MACKENZIE: GitGuardian, we try and protect source code in a number of different ways. So I think we start right at what GitGuardian was founded upon. And it was that the two founders, the CTO and the CEO, Eric Fourrier and Jérémy Thomas, and they were coding. They’re both developers and decided to do a little bit of an experiment and see what kind of sensitive information they could find on GitHub. And they weren’t really expecting to find anything too major. But in their small experiment that they did, they found this huge number of credentials that were leaked in public GitHub repositories. And they realized that a lot of these credentials, even though they may have been leaked on personal accounts, actually belong to organizations. And so they kind of realized that, well, there is a big problem here, and there’s kind of a few ways to really come at this. So number one is really what I’m surrounded in and that’s giving developers education and tools to prevent secrets from the up and GitHub, giving organizations the tools to be able to monitor their GitHub repositories, scan them and be alerted when secrets are in there, and then also giving these larger organizations ways to be able to monitor what’s actually happening in the public ecosystem. So there’s two and a half million commits made every single day on GitHub, obviously a tiny percentage, but still a huge number. Nearly 5000 commits a day contain secrets. So giving organizations the ability to be able to monitor what’s happening with their secrets out in the public ecosystem.

PAUL: I know GitHub is one platform. There’s also, like GitLabs, SourceForge. Is GitGuardian specific to one, to GitHub in particular?

MACKENZIE: No. So we cover integrations into all of the major VCs platforms. When we’re talking about public monitoring, we are specific to GitHub, because just by sheer volume of what happens in the public space, this is the platform where you have your open source projects. Even if you’re using GitLab for your internal repositories by a factor of 1000 or so, the public activity on GitHub is so much more widespread. So that’s where we really focus our attention.

PAUL: Fish where the fish are, as they say.

MACKENZIE: Exactly.

PAUL: You called “secret sprawl” is sort of the term you use to describe this phenomenon. Could you just describe what secret sprawl is and kind of what types of data or credentials are part of this phenomenon?

MACKENZIE: Yeah. I love the word secret sprawl so much because it kind of creates this image of kind of an alien almost sprawling through there. So when we talk about secret sprawl, ultimately, we’re talking about the unwanted distribution of secrets. So when we take your API keys, your security certificates, your database peers, credential peers. These are highly, highly sensitive things, but they’re also widely distributed amongst your team members, and they’re very easy to lose track of. So you can imagine if a secret ends up inside a Git repository, even if it’s private, it’s going to be cloned onto multiple different machines. It’s going to be backed up into different areas into different locations. It’s going to end up on your internal Wiki. Maybe it will be shared between developers and internal messaging systems. And ultimately, what you have is you have the secret that has been cloned into multiple different places, and you have no visibility. Ultimately, over where this ended up. So if there is a breach, if a malicious actor is able to gain access to a private git repository to the messaging system, then they potentially can access all of the sensitive information that may also be there. And so we call that secrets for all. When you have lost visibility over where your secrets are, and ideally, what you want is you want your secrets to be stored in a central location, where we can wrap lots of authentication layers around this and then securely share that between who needs it. But the reality is that that really happens 100% of the time. And one of what we’re trying to do is not only prevent secret sprawl, but also give visibility over secret sprawl, so you can actually identify where it’s a problem when it’s a problem and then take action to it.

PAUL: What are the secrets? Just so our listeners know what we’re talking about. Obviously, credentials for third party applications or platforms that might be embedded in code, but other things as well?

MACKENZIE: Yeah. So if we take the main ones, you have the third party applications, then you have credentials for your infrastructure, you have your encryption keys for storing data, access to your databases, all of these different things, and they can be kind of what we might call name services. So stripe AWS. These are the third party tools, but they can be secrets or access keys that you create for your internal communications and your internal modules or components that you make up. And when an attacker, when we look and we break down and do postmortems of the attacks that we see today, even if a credential isn’t the initial point of access, often what attackers are really looking for is gaining access to these credentials to elevate their privileges, to move laterally. So they are almost always used in an attack with some way or another. They can be the initial point of access, but they may just be part of the attack path that the malicious actor is going down to try and get deeper into your organization.

PAUL: So just to put some numbers around this phenomenon, I think in the last year, GitGuardian detected something like 2 million instances of leaked credentials or secrets and sent out, I think, around 900,000 alerts actually, to developers about this problem. So judging by that, it’s a pretty widespread problem.

MACKENZIE: Yeah, it’s a huge number. So 2 million, just over 2 million was what we detected in 2020. So we scan every single public commit that’s made to GitHub, so that’s two and a half million a day. So this huge amount of information, and each day we find about 5000 credentials and as part of our kind of pro Bono activity, and what we do to try and help the community is when we find these credentials. If it’s possible, if we can track down the developer through the metadata of the commit or through the account, and we send them out an alert to let them know that, hey, this AWS key, this Google key has been leaked in your repository. It’s public. You may want to do something about it. So it’s just a huge number and it’s actually growing. We find about 20% year on year at the moment in line with as kind of more and more code is coming out there. But this isn’t a problem that’s getting better.

PAUL: And, you know, just to kind of connect the dots to things that folks might have read about in the headlines. This type of secret leak is behind a bunch of security incidents breaches that you may have heard of. Equifax, there was a UN data breach in January. This is under the covers or under the hood of a lot of incidents that may be turning up in the headlines. Is that your understanding?

MACKENZIE: Absolutely. And it’s hard to even keep track of all the breaches that relate to credentials, because as I said, the ones that you mentioned and we can talk about a few more Uber and Codecov, that where credentials are the initial access point for the attacker. So they’ve found a credential in a public space. And this is how they’ve breached into the company got their initial access. And that’s crucial. But when we go deeper, then there’s a whole bunch of attacks that have been assisted with credentials even after they’ve made their initial access. So in the case of Uber, which has had multiple breaches. But one such case where attackers were able to gain access to a private git repository belonging to Uber because of poor password hygiene of one of their employees. So their password was exposed, the attacker was able to gain access to a private Git repository. And then in that Git repository, they found more secrets, which enabled them to move laterally and gain access to sensitive information. So the scale of secrets being used, it’s just absolutely huge. And it’s really what attackers are after, because there are certain alarm bells that go off when you’re trying to break into a company, there’s certain patterns that may come up that let the defense team know that there’s something not right. But when you’ve correctly authenticated yourself in these systems, you have the correct authentication and you’re really not going outside the scope of what your security team is expecting. It’s really difficult to even know that you have been breached, and it gives attackers opportunity to squat, remain undertaken for long periods of time, gain that information, gain that trust, and then launch an attack from there that is far more widespread than what it could have been.

PAUL: Right. You’re listening to a spotlight edition of The Security Ledger podcast sponsored by GitGuardian.

PAUL: And you point out as well that there are links between this problem and even more endemic problems like password reuse and weak passwords, just poor password hygiene. Insofar as developers who might have sloppy password habits may reuse a password between GitHub and some other service. Right. And that may lay open their developer account to a malicious actor who gets a hold of those credentials.

MACKENZIE: Yeah. Exactly right. And this is kind of the greater problem of the secret sprawl is that once you lose track of where your secrets are. Once they scroll into many locations, then an attacker only needs to gain access to those locations to be able to be able to launch that attack. So if you have secrets lying around on machines and an employee leaves his machine open and goes to the bathroom in a cafe, which a lot of people are working remote now. There maybe…

PAUL: Major call!

MACKENZIE: So there’s lots of different ways that attackers can gain access to these secrets. One of the ways that we focus on and we talk about a lot is, of course, then being leaked into public spaces. But there’s a bigger problem here that we need to solve, which is kind of making sure that these secrets aren’t sprawled, are centrally located and are protected.

PAUL: So you point out that one of the issues here is that kind of in the very nature of the GitHub platform that developers might manage both proprietary or commercial repositories and personal projects via, of course, one UI, one instance of GitHub, and that there’s this proximity of their personal projects and their professional projects, and that creates the opportunity for mistakes, either proprietary code being entered into their personal repositories, or, I guess, vice versa. But talk just a little bit about that. And what if anything, is to be done about that?

MACKENZIE: Yeah. It’s really interesting, because when we enter into this topic, we’re talking also about this kind of blurred line between professional and personal that has really been accelerated. So we talked a little bit before about remote work due to the pandemic just being accelerated. So we’re using the same computers. We’re using the same Git accounts for everything that we’re doing. This delineation isn’t gone. It isn’t there. So I mean, one thing, obviously, it makes it extremely easy if we’re using the same authentication to be able to accidentally push code that’s meant for your professional private repository into a personal public repository. For example, it makes it very easy. But it also means that there is this blurred line where organizations don’t have control and they can’t enforce security over what you do personally, they don’t really have the authority to kind of control what you’re putting out there in the public space and your personal projects. So whilst we can implement policies for organizational related repositories or activities, and we can enforce these to some degree, we’re really blind when it comes to what our employees are doing personally. And with this lured link between professional and personal life, it’s very easy for these professional keys to end up on personal repositories. And in fact, what we found is that of the keys that we detected, which we know belong to organizations. So corporate API keys, corporate credentials, 85% of them were leaked on employees personal public repositories. So the most amount of leaks that you face as an organization are going to come from your employees accidentally pushing them to public spaces.

PAUL: And how does that happen? Practically? Just if our listeners aren’t familiar with how a platform like GitHub works, how do you fat finger it as a developer on GitHub and end up pushing those credentials out to an insecure personal project that you might also be hosting?

MACKENZIE: Yeah. So, I mean, look, there’s a number of ways that it can happen. You may be working on a personal project, and we can see sometimes that projects are forked from company repositories, starting again to get some kind of base deep into git history. So when you clone that project, you’re cloning the history of it, too. And there may be keys leaked out there in the history. You may push that publicly thinking that that’s all kind of your own work, and you just use that as a guide. But in that history, you may expose company information. It might be that you’re kind of working on something personal, but in it with some professional relationships. The one thing that we find and we’re particularly interested is slack keys, because we know that attackers use these slack keys either to kind of gain access into private messages or to launch phishing campaigns. So it looks more official. And so we find that a lot of people will kind of take on these personal projects, of building a little slack bot that does something on their work. Slack channel may not be necessarily related to what they’re doing at work, and so therefore they put it on their personal repository and they make it public. And then we’ve got your corporate keys in there. So there’s another way. And then one of the most common areas is just accidentally pushing code into the wrong repository. You’ve missed and keep configured with your terminal and how you’re doing it, and you just push it into the wrong repository. And all of a sudden it’s public. And there’s this really tricky area is that when you make a mistake like that, you often can pick it up quite quickly, and your knee jerk reaction is just to delete. Either delete the keys and commit over top of them, or just delete the repo. But what people don’t understand is that places like GitHub are monitored constantly by bad actors. We monitor it. We can leak a kind of a honeypot token, and it can be exploited within a few minutes from an attacker. These also are backed up across the place. Github is backed up in lots of different locations. So even if you delete it, if you commit over it, it’s still going to be there in the history. If you delete the entire repository, the chances are someone’s already found it.

PAUL: And then this intersects with kind of longstanding kind of loosey-goosey developer behavior. Like for early versions, non production code, you might hard code in credentials just to save yourself a lot of work or to make things run smoothly and test. And then, of course, when you get to production or shipping, you remove those credentials. Although, frankly, there are plenty of examples of companies not removing those credentials. But as you point out, it doesn’t matter so much in GitHub because all that code history is part of the record. And therefore, if you did that at any point in development, arguably, those credentials are exposed.

PAUL: That’s exactly right. And people kind of rely on a lot of things like code reviews. But when you’re talking about a code review, if I’m quickly trying to get something to work, I’m working on my own independent branch. So I hard code some credentials. I get it all running, and then later on, before I merge it into my master branch, I clean it up because I know people are going to look at it. I delete the hard coded credentials, and then when the reviewer comes to see it, they’re looking at the latest version. They’re comparing it to the master branch, and it all looks great to them. They pull that in unbeknownst to them, there are credentials in that history. Now it’s very hard for you to be able to find them as the reviewer or even as the organization. But an attacker that is specifically looking for those is going to be able to find them in a few seconds. We’ve talked a lot about public repositories, but it’s just as important to keep private repositories free from these credentials, because as we’ve talked about, attackers can gain access to these. And in a lot of cases, they’re really targeting these, and they’re just a known treasure chest of information, sensitive data because their histories are just so rich. And it’s very hard without automated tools to be able to get visibility into your history.

PAUL: You know that sometimes developers make the job super easy by putting comments with a commit like remove secrets from repo.

MACKENZIE: Yeah, exactly. If you search on GitHub for removed AWS key as the credit message… Yeah no, you will get if you search on GitHub for removed AWS key as a message, you will get thousands of results. People will do that. And if it gets deleted or if it’s no longer valid. So we’ll do a validity check. So for instance, we’ll check that this AWS key not only looks like an AWS key, but it actually gives me access to a system, and so we’ll check periodically over time. Is that secret still there hasn’t been removed? Is it still valid? So we can see that a lot of these secrets, maybe they’re committed over. Maybe they’re removed completely, but they remain valid for a long period of time. A huge problem, a little bit of a misunderstanding about what we’re facing with technology like yet.

PAUL: Right. So I guess one question would be, what can organizations do? Let’s say development organization of any size almost certainly has a problem around secret sprawl. What can they do? I don’t get the sense that it’s sort of the dye is cast, and once the secrets are out there, then you’re just screwed. What can organizations do to clean up this mess? First of all, figure out what the dimensions of it are within their organization and then take steps to really clean it up and remove the exposure, the risk.

MACKENZIE: Yeah. That’s a great question. I think there’s really four areas that you can do. I think number one, the most important area is to implement secret detection on your systems to make sure that your repositories are clean, including their history, and that you’re running from a clean basis. So that’s definitely kind of the first point to make sure that there’s nothing lying around and that you have a way to actually handle and store these secrets securely. The second area can come on. Really the developer’s side. So it’s not enough to really tell people: ok adding secrets into source code is bad. Don’t do it. It’s almost never malicious that these happen. We’re humans, we handle sensitive information and we’re going to make mistakes. So we have a tool called GG Shield. It’s an open source project, and it allows specifically designed for developers, and it allows them to be able to install, say, a pre commit check, a pre commit hook or post commit hook, where if they commit in their code, something with sensitive information, it’s going to be picked up before it gets into the Git repository. And that is really important, because once it enters your Git repository, even if it’s private, that key needs to be rotated, it needs to be considered that it’s been compromised. So there may be implications of that. There may be active systems that are tied to that. It’s not just a simple matter of deleting and issuing a new one, but if we can pick them up before that point with a pre commit hook, well, then you’re not red faced going to the team to let them know you did a mistake, and we can remediate that, and we don’t have any long lasting impacts from it. And then we can also move on to other areas, like making sure that our wider perimeter, so making sure if employees leak keys on their personal repositories, we may not be able to enforce security policies, but we can still monitor the activity implementing them and then also implementing checks on our finished applications. We find a lot of secrets in areas like Docker images, running automated scans on your Docker images once they’ve been built. So all these implementations, but I think number one clean up your house first, make sure you haven’t got any dirty laundry lying around in your repositories, and then give your employees the tools that they need to be able to take action themselves and take some ownership over the security problem.

PAUL: So your data, and your data that you’ve got some really interesting information or insights into where this is particularly a problem, kind of what countries, for example… Are there any real clear patterns as to whether this is a problem that is concentrated in certain types of organizations or even certain populations and developers? Or is it really kind of an endemic problem that’s just everywhere? And maybe it varies based just on how much software development is going on at that particular place?

MACKENZIE: Yeah, we’ve tried to find patterns in kind of areas, and we can find some, like, India is the number one country for elite credentials, but they also have a huge engineer population.

PAUL: Indeed.

MACKENZIE: And we’ve also tried to find patterns around the type of developers that leak credentials. And, of course, your junior developers are going to leak a little bit more. But we also find that when we’re talking about the serious incidents, well, it’s actually really kind of across the board. What we’re talking about is a mistake, human error that can happen in a number of different ways, and we can’t find any clear data that really shows this is the profile that is most get risk. It’s actually just a problem that can affect everyone. And we see this with senior engineers as much as Junior when we’re talking about the kind of the real corporate keys that are being leaked, it’s just a very widespread problem, and it doesn’t seem to be any really clear patterns over who the persona is. That is kind of most at risk for a company.

PAUL: Interesting stuff. Mackenzie, is there anything that I didn’t ask you that I should have?

MACKENZIE: Yeah. No. I think we covered just about everything in here. I think we’ve got a good range, and I don’t want to overwhelm everyone too much with information and doom and gloom.

PAUL: Well, I mean, you’ve got a solution, right? So they shouldn’t feel too gloomy because there are things you can do. It’s not just we’re all screwed.

MACKENZIE: And there definitely are things that can be done. And this goes together with a holistic approach to security. What I really love about the solution is that we’re not just monitoring the security events and kind of letting it go to the security team. The developers are an active part of the conversation, and they can take ownership of themselves with the solution, too.

PAUL: Mackenzie Jackson of GitGuardian, this has been a great conversation. Thank you so much for coming on and speaking to us on the Security Ledger podcast.

MACKENZIE: Thanks for having me. It’s been great.

PAUL: Mackenzie Jackson is a developer advocate at the firm GitGuardian. He was here to talk with us about the problem of software secrets sprawl. You’ve been listening to a Spotlight edition of the Security Ledger podcast, sponsored by GitGuardian. GitGuardian helps secure the software development lifecycle by automating secrets detection for application security and data loss prevention purposes. GitGuardian Solutions monitor public and private repositories in real time, detecting secrets and alerting staff to allow investigation and quick remediation. GitGuardian and helps developers operations, security and compliance professionals secure software development, define and enforce policies consistently and globally across all of their systems. Check them out at GitGuardian.com.

[END OF RECORDING]