Amazon AWS - S3's exposed public buckets
Ok, so I know what you’re thinking, this has been talked about over and over again. But this story is far from dead, I’m going to look at the current state of public S3 buckets and step into the mindset of people hunting for public S3 data. Despite the news stories and Amazon’s own direct warnings to customers, there’s still large quantities of exposed data and similarly large quantities of white and black hat hackers looking for it.
On the 12th of June 2017, security researcher Chris Vickery (@VickerySec discovered a public S3 bucket containing, amongst other things, the personal details of 198 million American voters. The data treasure trove was vast, with 1.1TB of insecure data and a further 24TB of secure data. This leak has been well documented, Gizmodo has a nice piece on it if you’re after more detail.
Following this and other high-profile data leaks, Amazon contacted owners of public buckets directly, making sure they understand the implications of allowing public access to buckets.
How S3 works
S3 is just a storage platform; Let Amazon worry about looking after your data, making it easily accessible to people and applications all over the world without worrying about maintaining your own infrastructure. I have few criticisms of S3, it does what it says on the can, and does it exceptionally well.
S3 data is broken down into buckets, you can literately think of them as buckets of data. Each bucket gets a URL, for example, bucketname.s3.amazonaws.com. You can try it yourself: http://bbc.s3.amazonaws.com/. Who knows who owns that bucket, it could be the BBC or more likely it’s someone else that happened to name their bucket BBC. Either way, as you can see from the response, it’s private.
We get a 403 and a pretty self explanatory error message.
In simple terms, every file stored in S3 can be made either public or private, both options function as you would expect. Public buckets are a core function of S3 and in most cases exist perfectly legitimately. Many images and videos, etc and even entire static websites are hosted in this way.
Let’s look for some public buckets to poke around in
I was trying to think of a real world analogy for what we’re trying to do here. Invariably, these analogies don’t do a good job of converting the digital into the real world, but here we go anyway. Exploring public buckets, in my opinion, is not quite the same as trying the doors of peoples homes to see if they’re locked or not. It’s more akin to someone accidentally leaving their personal possessions on display in a shop.
This shouldn’t be too tricky, we could get lucky guessing names of buckets, but let’s speed this up a bit. A quick few lines of NodeJS plus a few NPM modules, and I can try 1000s of buckets a second. All I need now is to plug in some wordlists:
But if we’re really looking for gold… the top 1m sites according to Alexa in a useful list. I stripped off the TLDs and plugged them into AWS.
An alternative to using wordlists, if you had the resources and access (which I don’t) would be to capture a large volume of DNS / TLS hello traffic / response headers from a network appliance or VPN / Tor exit node. So many sites and apps utilise S3, you’re pretty likely to build a comprehensive list of well-used buckets.
I’ve decided against publishing the source code, but there’s no black magic in there, pretty simple stuff.
Sure enough, 1000s of public buckets come scrolling by. From a few files up to hundreds of thousands or more. My initial thought was to put all the results into a Mongo database, creating some kind of public bucket search engine for off-line analysis of the results. However, I quickly realised the quantity of data was so vast that my NodeJS script and maxed-out iMac weren’t man enough. So instead I decided to filter the discoveries in real-time and crudely notify me in the console of any interesting discoveries.
What’s in the results
Public buckets fall into three categories:
1. Static websites
This represents the majority of discoveries. A collection of HTML, JS, CSS, etc, nothing that isn’t already downloaded by your browser every time you visit the website. Any data that’s exposed in these buckets is also going to be exposed on the owner’s website.
This site is actually hosted in a public S3 bucket (jamesbeck.co.uk.s3.amazonaws.com if you’re interested).
2. Static content storage
This is another classic use case. Take for example a video or image hosting service not dissimilar to YouTube or Flickr. You allow users to upload content to the site which they can then share with other web users. Some of this content will be indexed in the sites internal search facility, other bits will be private, allowing your users to choose who they share it with.
Typically this is achieved by generating a hash or key for each image, making it publicly available, but only divulging the key to the owner of the content. For example, www.amazingimagehosting.com/images/167C1DE9E9FECC51FA7E0B74C64831E00B7C4CC6.
Nobody is ever going to guess the files name, they have to access it via your website. Problems arise when you allow your S3 bucket to make a public list of it’s content available. It’s a simple tick box in S3, you can keep all your files publicly accessible, but only to anyone that knows the address rather than publishing a list of them.
I found many, many buckets that fall into this category, including:
- The entire archive of an iOS and Android app that offers peer-to-peer video and image messaging (hashed file names for user privacy, neatly indexed for me)
- The entire content archive of an adult video and image host (again hashed file names thwarted by the bucket index)
- What appears to be the entire output of a TV station in neat half hour segments (again indexed for convenience)
I’ve responsibly disclosed my findings to the owners of some of these buckets where I’ve been able to identify them. In most cases, unless they’ve got a reason to allow public listing of their files, it’s a very quick job to log in to S3 and remove the listing permissions from their bucket whilst keeping the files publicly accessible. I’ll keep you posted on any interesting replies I get.
In an ideal world, this content would be completely private and delivered to your users via a web server that can control access. That way the server can request the content from S3 (after authenticating) and then stream it on to the user. However, this approach will remove many of the benefits of utilising a cloud storage platform and or CDN.
3. Stuff that simply shouldn’t be shared!
The final category of exposed bucket simply has no excuses. We’re talking an abundance of company financial records, database and file backups, entire computers including ‘My Documents’ folders, the list goes on and on.
The majority has likely been set up by amateur developers to store files being saved by their applications, others are clearly for personal backup or file sharing usage.
There is a lot of content residing in public buckets that shouldn’t be. Along with the obvious data leakage, you could feasibly rack up someone’s bill with AWS by making millions of requests to their bucket.
In reality, many business owners won’t even know they’re using AWS behind the scenes and their development company won’t care about the project they wrapped up and got paid for last year.
Could Amazon do more?
I think AWS is fantastic, my limited experience of it has been flawless. I’m about to switch my personal backup to S3 even though I’m one of the lucky people that started using Google Drive when unlimited storage was actually a thing. Currently, S3 strongly defaults to making buckets and objects within them private. If you make a file public and then re-upload a new version via the S3 console, it automatically resets it to private.
This is pure user and developer oversight, nothing more complex than that, there really are no excuses.
That said, there are a few obvious things Amazon could look at it. I was able to make 100s of 1000s of requests to AWS, in a short period of time, for different buckets, with identical request headers (would have actually disclosed I was using NodeJS’ request module) and from the same IP address. It wouldn’t take a genius IDS to profile these request as potentially malicious and block or begin to significantly rate-limit them. It should have been obvious I was clearly searching for public buckets. Of course, in reality, I would have just changed my VPN exit node and cracked on but it still represents a logical step to deterring and slowing down those hunting for public buckets.
I also don’t really understand why Amazon allows you to name your buckets in the first place, surely a 128-bit key would be far better. There’s no value in having human-friendly bucket names. Most buckets are accessed programmatically by software or accessed by setting up a CNAME record on an alternative domain name. This really would mitigate a large part of the problem.
Who’s looking for these buckets?
Well, that really is the question, it’s highly likely that there are entities out there looking to illegally leverage the exposed content in these buckets for their own gain. Especially after some well publicised, high profile finds.
To test the theory, I’ve set up a public bucket put some juicily named files in there and enabled logging. Obviously publishing the name here would completely ruin my experiment but you’re welcome to go looking for it! I chose a non-relevant dictionary term with the logic in mind that people would follow my train of thought when looking for public buckets.
I did apply a little catalyst by posting the URL and paths to various files on Paste Bin. I’ll report back as a follow-up post in a week or so.