Satellite operations at Capella is an exciting job focused on making sure our satellites and surrounding systems are running at their best to provide our customers with the fastest and highest-quality data possible. We work at the intersection of software engineering, satellite systems engineering, and DevOps, giving us a unique perspective on the entire production stack from order submission all the way through the ground software, ground stations, and satellites themselves all the way to data processing and delivery.
The core of our operation is automation. We operate six satellites taking dozens of collects and contacts every day, 24/7/365, which would require a tremendous amount of human effort if each contact and collect were managed manually. We've developed a powerful software system with aspects running in both the space and ground segments to provide a fully hands-off, lights-out constellation operation in all nominal and many off-nominal cases. This allows us to maximize our uptime and system productivity while minimizing the expensive and error-prone humans-in-the-loop.
But it's not all mai tais on the beach while the system flies itself! Today I'm going to cover a few of the things that a satellite operations engineer does on a daily basis to keep our system running smoothly.
Being On-Call
Each week, one of us is on-call. We rotate among the six of us so that nobody gets too overloaded and we all get a chance to take time off. The on-call engineer gets a phone call if our automated monitoring system detects a fault case that cannot be resolved on its own. We can be called for a number of reasons ranging from a small but important inaccuracy in our satellite's orbit solution to on-board faults that cause a satellite to put itself into a safe configuration while waiting for further instructions. After the phone rings, we get online to assess the situation. Typically the issue is pretty straightforward and we have a pre-written procedure to run through that has been used before to resolve the same problem.
Occasionally we experience a novel fault and that's when the fun really begins! We assess satellite telemetry, review the software and hardware associated with the misbehaving parts, and work out a plan of action. The actions usually involve restarting parts of the system, changing configuration for various onboard processes, and writing software patches. In the event that we're stumped and can't figure out how to proceed on our own, we have a fantastic support team we can call on to bring subsystem expertise and get to the root of the problem. Most issues are fully-resolved within a couple of hours of their onset.
Whoever is on-call also gets to run our weekly Reliability meeting. In this meeting, we review any instances of downtime or failures to deliver a collect to a customer's satisfaction. Each line item is root-caused and added to a database of known issues. We look at recent and historical error rates for various bugs and faults, then tackle the top few most-impactful error causes each week. This process ensures that we identify new issues as they occur while also focusing our effort on the bugs having the biggest impact on customer satisfaction. We've been running this process for a couple of years now and seen a huge improvement in system performance and the end results for our customers.
Development Work
When we're not on-call (or if it's a quiet week in space), we get to work on projects that improve the long-term performance and reliability of our operation.
Our focus is usually automation: we never want to do anything by hand more than we have to. Satellite Operations Engineers will review recent anomalies that required an operator to fix the issue and write software that duplicates the decisions and actions the operator took. The next time the same fault occurs, that software will run and fix the problem autonomously instead of having to call the on-call engineer. This also improves the speed of resolution since software can assess telemetry and make decisions more quickly than a human. Other automation work involves our launch and commissioning process. When a new satellite first arrives on-orbit, we have to perform a number of setup, configuration, and calibration steps to ready the satellite for normal operations. This process used to be fully-manual, requiring over a week of time from multiple engineers across multiple shifts. Now, we can commission a satellite in 2-3 days with fewer than half of the operators that used to be involved. Our ultimate goal is to have a completely hands-off commissioning, so the team works on software and data analysis towards that goal when we're in-between launch campaigns.
We also support system changes and upgrades. Our satellite R&D team is constantly improving our software and hardware to fix bugs, increase performance, and add new features. In cases where we have software updates to do on existing satellites, we develop and test the upgrade procedure on satellite hardware here in the lab. We determine the steps to execute, the telemetry to confirm, and the reversion procedure if anything goes wrong. We encode all of this into ground and on-board scripts to maximize the safety and speed of the operation. Once we're confident in the software, we perform the upgrade on the satellites in space and make sure the system continues working with the new changes. If we're supporting hardware revisions for future spacecraft, we review our existing procedures and scripts for places where the hardware change is relevant and work with the design engineers to add, subtract, or change our commissioning and operations scripts to accommodate the hardware changes. This also gives us a chance to learn the details of the system so that we can better respond in the event of an anomaly.
Closing
In my personal opinion, being a SatOps Engineer is the best job at Capella! We get to work with nearly every corner of the company, dive into deep technical problems across the satellite and ground system, and no two days are the same. It's a privilege to be at the tip of the spear, jumping on-console at 3am to solve a problem we've never seen before but has to get fixed as soon as possible. Every day gets us one step closer to 100% uptime, 100% on-time deliveries, and 0% human-in-the-loop operations, and I'm excited to keep innovating and building towards that goal!