This living document is a catalogue of other documents and explanations of their gaps in logic. The conclusion of these other documents could be valid, however that is by luck, not because the author(s) are using sound logic. This document will be updated as new ones are discovered.
This document is the output of multiple individuals. Their thoughts and ideas have been collected and combined.
The lesson here is threefold. First, the Big Rewrite is almost a sure-fire way to ensure a project fails. Avoid that temptation. Don’t look into the light. It looks nice, it may even feel nice. Statistically speaking, it’s not nice when you get to the other side of it.
The second lesson is that making something microservices out of the gate is a terrible idea. Microservices architectures are not planned. They are an evolutionary result, not a fully anticipated feature.
Finally, don’t “design for the future”. The future hasn’t happened yet. Nobody knows how it’s going to turn out. The future is going to happen, and you can either adapt to it as it happens in the Now or fail to. Don’t make things overly modular, that leads to insane things like dynamically linking parts of an application over HTTP.
The author describes a system that got rewritten and the results didn't work out, then proceeds to generalize lessons. But the article doesn't support the conclusions. Is the Big Rewrite a sure-fire way a project fails? Maybe, but the article doesn't provide anything beyond a single example. In fact, in the given example the rewrite makes it all the way to accepting requests from customers. When they describe what went wrong:
We set it up, set a trigger for a task, and it worked in testing. After a while of it consistently doing that with the continuous functional testing tooling, we told product it was okay to have a VERY LIMITED set of customers have at it.
That was a mistake. It fell apart the second customers touched it. We struggled to understand why…
And after a week of solid debugging (including making deals with other teams, satan, jesus and the pope to try and understand it), we had made no progress.
But performing better testing is not in the “lessons learned”. Nor is making a system that can be debugged. It seems quite odd that a new system, taking very little traffic, would be so hard to debug. That seems independent of any Big Rewrite.
The second lesson is about developing as microservices. Whether or not this is a good idea likely depends heavily on the surrounding organization. Is there support for developing and operating microservices quickly? Is there a institutional knowledge for doing that? If so, perhaps it is a good idea.
Finally, the final lesson is contentious in software engineering. Should one plan for the future? If so, how much? It's common enough that there is a pithy way to express not planning for the future: YAGNI (You Ain't Gunna Need It). But the article doesn't making a compelling argument for this. Again, the article makes it clear that the product was not tested sufficiently and was not debuggable.
At SoftwareMill we’ve been experimenting with different tools for 10 years already. Here are the apps that are currently being used by our teams. Let’s see whether you find something interesting for your project!
Despite the title of the article, how the tools presented translate into faster work is never described. The list of tools seem to be those that the company likes using, but no evidence is presented that they make work faster. It is certainly fine to have a preference for various tools, however that does not mean those tools make one more productive. And a tool does not need to make one more productive in order to have a preference for it.
This article contains a list of Pros and Cons about various programming languages that the author feels qualified to comment on. The end of the post is also a list of languages for which the author admits they would never consider for usage, so their inclusion in a blog post about how to choose a language makes little sense.
Disclaimer: I don’t like the rest of these programming languages and would not use them to solve any problem. If you don’t want your sacred cow gored, leave here.
The “How” in the title of the post implies some sort of process will be described but no such process is given. How does the author take a problem, this list, and then choose a language? A list of Pros and Cons about a tool, even if that list is accurate, is generally insufficient to determine if the tool is right for a job. In the list of languages the author would choose, it is not clear in which case they would use a particular language. Take C#, for example:
Pros: less boilerplate than Java, reasonably healthy package ecosystem, good access to low level tools for interop with C, async/await started here.
Cons: ecosystem is in turmoil because Microsoft cannot hold a singular vision, they became open-source too late and screwed over Mono.
Even if all of these points are accurate: how does one decide to use C# using this information?
It is impossible to determine how the languages compare to each other because there is no consistency between the Pros and Cons of each language. For example:
The author states that the best way to learn is by doing, and in this case that is likely true because this post provides little information the reader can apply.
The opening question of this post is:
In my last post, I talked at length about how our consistency and validation testing puts Redpanda through the paces, including unavailability windows. Fundamentally, we left a big question unanswered: does Raft make a difference?
The following experiments are, presumably, meant to answer that question, but comparing Kafka and Redpanda. Both of them have been configured as CP systems.
The first experiment involves performing a
kill -9 on the leader, and the
results show that both Kafka and Redpanda have a period where no successful
requests drops as leader election takes place and then continues.
The next experiment compares Kafka and Redpanda when a 10ms latency has been added to the write operation of the leader. This means the consensus protocol is running as normal but any disk writes on the leader are now slower.
We simulate disk latency spikes by adding artificial 10ms latency via FUSE to every disk IO operation on a leader for a minute. It sends Kafka’s average latency to 980ms (p99 is 1256ms) and Redpanda’s to 50ms (p99 is 94ms).
It's not clear what the conclusion one is meant to draw. This information lacks any context or interpretation. Is this just an artifact of Kafka doing more writes than Redpanda? Does that have anything to do with the consensus protocol? Does the Redpanda increase in latency make sense? We don't know.
Next up is the impact followers have.
Kafka uses sync replication, so any disturbance of a follower affects user end-to-end experience until the disturbance is gone or until a faulty follower is excluded from the ISR (a list of active followers).
Redpanda uses quorum replication (Raft) so as long as a majority of the nodes including a leader is stable it can tolerate any disturbances.
As a result, Raft is less sensitive to the fault injections.
This description of the situation makes little sense. The stated configuration of Kafka is that it is doing quorum replication as well, so what is the distinction actually being described here? And more importantly: what does this have to do with Raft vs Paxos (the consensus protocol that Kafka uses underneath it all)?
But the following experiments confirm that Kafka is doing something that Redpanda is not. When a random follower is terminated, Kafka experiences a jump in timeouts that Redpanda does not. When a random follower has latency injected in its disk IO, Kafka experiences extra latency where Redpanda does not.
The post ends with a quote from a paper by Heidi Howard and Richard Mortier:
We must first answer the question of how exactly the two algorithms differ in their approach to consensus? Not only will this help in evaluating these algorithms, it may also allow Raft to benefit from the decades of research optimising Paxos’ performance and vice versa.
By describing a simplified Paxos algorithm using the same approach as Raft, we find that the two algorithms differ only in their approach to leader election
The last sentence is important: they only differ on leader election. The author(s) of the blog post put emphasis on the importance of having a mental model of a system and testing it. But if the only difference between Raft and Paxos is leader election, then why did we see differences when the leaders had latency injected into them? That does not cause a new leader to be elected. Why did we see different performance numbers when a follower had latency injected into it? Or when a follower was terminated?
The conclusion of the author(s) seems to be that Raft makes a difference, but their experiments don't provide evidence as to why. Their academic references would even seem to contradict their own conclusion.