Anything that can go wrong will go wrong (Distributed Systems)

Not what any software engineer wants to acknowledge. Well in the early days of my own startup, I was optimistic about anything I'm working on, or building upon. It could be any piece of code, any infra, or design. That one scenario we think would be highly unlikely occurs and you aren't prepared to handle it. This becomes more true as you scale involving more variables in the system. So anything in your system/ code that you think will work 99% of the time will one day fail and you should be prepared to handle that, unlike me :p

Now if we move into distributed systems there comes 100s of such variables in systems where things could go wrong. And at this point of time I have seen them happening at scale, but this time we were able to handle it. So never assume anything to work always. Even if it's probability is <1 % it could happen anyways. And if you think about this at big companies. Suppose there's a chance that any computing server could go down once a year. If it has 300-400 such servers then daily one server could go down. And if we think of 10000s of servers then each hour this could happen and we need to handle all that.

So be pessimistic while designing any system, writing any piece of code, managing any infra. Question each and every aspect of it and assume a doomsday scenario while doing any of that. This would eventually help come up with much more reliable, resilient, good software/ systems :)

I have learnt it the hard way and many of us do. Hope this shines some light on such nuances.