Projects agmk-imports squall Files
$infra/dev Loading last commit info...
.hive
gradle/wrapper
modules
.gitignore
README.md
build.gradle
common.gradle
gradle.properties
gradlew
gradlew.bat
postgres.gradle
settings.gradle
README.md

Squall

Network distributed target acquisition using RabbitMQ.

WIP

Docs are a WIP and extremely sparse. May even be incorrect in places. You're delving into unknown depths with an untrustworthy map.

Workflow

Producer is a standalone process that fills RabbitMQ. Consumer reads from this queue and process work, periodically sending progress reports to Observer. When Consumer has finished its job, it sends the result to Collector and then requests another job from Producer.

All of this is done with message passing through RabbitMQ. "Producer" pushes to "Jobs". "Consumer" reads from "Jobs", sending progress to "Progress" and job results to "Results".

Collector reads from "Results" and updates the database accordingly.

Observer reads from "Progress" and updates the database accordingly.

Producer reads from the database and feeds the "Jobs" queue accordingly.

Outside of necessary DB connections, every node is stateless. Every job is idempotent. Progress and JobDone packets are timestamped to the millisecond for versioning to avoid collisions. The system makes no attempt at error correction outside of Consumer retrying downloads. Failed jobs are simply marked as failed and aborted. It is up to a sysadmin to reset failed jobs once a batch is complete.

All of this is in an attempt to not overcomplicate the system. At the end of the day, we're working with terabytes of data flowing over network pipes. Failures will 100% occur. We want those failures to be external and not internal. Internal failures should be easy to track down and one way to facilitate that is with less code footprint.

Architecture

Everything runs in k8s pods using a cloud-provided control plane. Data is stored in AWS S3. Nodes run in high-network-bandwidth EC2 containers.

The following secondary services are required to get everything working:

  • RabbitMQ: Facilitates communication between nodes. Nodes don't talk to eachother, they pass messages through RMQ feeds.
  • PostgreSQL: The main job store. A job is a row in a table. State is serialized through SQL.
  • Consul: Consul is used to provide observability into the swarm. Since no nodes talk to eachother, we need a way to know who's where, what they're doing, and when they joined.
Please wait...
Connection lost or session expired, reload to recover
Page is in error, reload to recover