Recon Series: Automation (Part-3)

This is the last part of the Recon Series which focuses on automating the recon process

Reconnaissancereconautomationbugbountyawscloud

Bhavarth Karmarkar

November 8th 2023.

Recon Series: Big Automation (Part-3)

In the previous two editions of this series on recon, we began by understanding the significance of recon in identifying hidden vulnerabilities. Part 1 introduced subdomain enumeration, passive and active recon, and valuable tools. Part 2 took us deeper into directory bruteforcing, public archive URL fetching, parameter discovery, and advanced dorking techniques.

Now, in Part 3 we are going to learn automating majority of the open-source tools and processes for making our lives easier. This automation walkthrough is not just about thinking of automation but think of automation at scale. Almost every bug bounty hunter uses some form of automation to automate their process. The tools mentioned in previous 2 parts along with all of their different options are not meant to be typed by hand. Just imagine a scenario where you get 10K subdomains after subdomain enumeration, anyone wise enough would know that it is not possible to type in the full command: ffuf -u https://sub1.target.com -w seclists/Discovery/Web-Content/directory-list-2.3-medium.txt for every subdomain that was found!

Most of the bug bounty hunters will at least do something like:

for sub in subs.txt; do ffuf -u https://$sub -w seclists/Discovery/Web-Content/directory-list-2.3-medium.txt | tee -a $sub.directories

While this can be helpful for smaller mundane penetration testing tasks, it is nowhere nearly efficient if you want to wake up everyday with a cup of coffee with some actionable intel to begin your bug hunting for the day, and even land some low hanging bugs,who knows!

Why such a fuss about Scale?

Since the scale of bug bounty scopes and targets is really large these days with thousands of subdomains and new changes and functionalities being introduces continuosly , we have to up our recon game to match that scale. So it is really important to design an infrastructure which will regularly scan all the targets from a database so that we are among the first ones to detect the existence of a new subdomain on a target, or a new endpoint. Moreover, we must also setup a notification mechanism which automatically alerts us if something interesting has popped up in one of our targets.

This blog will completely skip the code and only go over how to design an architecture which can be implemented in our automation and is highly scalable.

Architecture and Design

Before even beginning to write the code, we must come up with a reliable, scalable, upgradable and maintainable architecture for our recon process. This is a crucial step if we don't want to start with ground up after realising any error in assumption.
First, Let's get the basic understanding of what I mean by reliable, scalable, upgradable and maintanable out of the way.

Reliability

Since open source tools form the backbone of our entire process, we have to ensure that any modification of these tools does not cause an avalanche effect. So we need to assume that any of these tools can fail anytime. Thus one can say that reliability directly relates to error handling.

flow_dig1

flow_dig2

Scalable

As there are really a wide variety of targets out there, we must ensure that our process scales to fit the target scope. This means that we must only use so much resources as the target demands. Now that the first thing that comes to mind after hearing this is to use Cloud Based Solutions. And indeed we have to move to cloud if we want to be scalable as they provide auto scaling up and down based on the load. There are basically two design choices for scaling available:

Vertical Scaling When load increases, switch a more powerful VPS with higher RAM, more powerful CPU etc.
Horizontal Scaling When load increases, more VPS instances with similar computing capability are spun up.

Personally I prefer horizontal scaling as it provides better concurrency than a single VPS. For reconnaissance, we don't require much resource intensiveness rather we prefer better concurrency.

Upgradable

New CVE's and exploitation techniques appear nearly every week with the ever growing research in cyber security by fantastic researchers. To keep our process up-to-date with the latest exploits, we want an ability to easily add new commands and pocs in our code. Nuclei templates are the best example of how this can be achieved.

Maintanable

We must ensure that our infrastructure is divided into small modules, with each module performing one specific action, for example one module to run commands which we define in configuration files, a separate module for spawning workers to execute jobs, another module to queue jobs and so on. This greatly reduces the time to debug and fix errors, since you just need to find that particular module and make changes at a single place.

Now that you got a clear understanding of what the key aspects that our project must include, we can start thinking of the architecture. We are going to divide the whole process into modules to accomplish the task. A high level overview of our architecture will look as seen in the below image

process_architecture

Some things to pay attention to in this architecture:

There is a primary VPS Server which will monitor the entire recon process and be used to start, pause or stop the main application. For carrying out the actual reconnaissance, clusters of small VPS instances or docker containers are spawned, called as workers, which pull out jobs from a global queue. [Race Condition needs to be handled separately].
We have divided the process into different modules.
- There is a module for interacting with the databases which will allow us to store the recon data in a postgres or sqlite database in structured relational tables, an example db relation can be seen in the below image
- There is another module which is responsible for queuing jobs. A job in this context is an abstraction for a command or a tool which needs to be run, maybe we have configuration files for each job, which then gets converted into a bash command. It then gets pushed into a queue for which we can either use Redis or AWS SQS.
- A separate module is used in the workers and behaves like a client application, which is responsible for pulling jobs from the global Job Queue. It extracts information from the job and performs the steps needed to complete the job. After completing the job, it reports its results back to the Primary VPS.
- There is also a dedicated module for monitoring and collecting the results from the workers which makes sure that the workers are running properly and interact with the workers.
  
  Some further Improvements in the Architecture can be thought of as:
A system for notifying the users about any potential signal which is worth inspecting manually.(Hint: Already done by Project Discovery)
Web based dashboard enabling us to monitor and analyse data via informative graphs, charts etc.
Leveraging the power of AI - using GPTs to further improve your threat signals.

[NOTE] The precise technical implementation of how the process is to be carried out has been abstracted away in this blog and the technologies mentioned wherever are only examples of how a specific purpose can be achieved, ignoring their limitations.

Conclusion

Before ending this blog I just want to say that before building a recon process, you must always conduct thorough research on how to outsource stuff as much as possible, For Eg. The AWS Fargate can be used to handle all of the scaling for us. This is way more better than drowning in frustration trying to implement your own logic of automatically spawning and stopping containers etc. But you must make sure that you completely understand every moving piece of your automation completely as it can help you save time and money.

Always remember automation is not equivalent to automatically finding great bugs, it can only be seen as a tool to help you focus more on what's actually interesting rather than wasting time on false signals.