The result of this one-char bug was pretty serious, if the new master
had the same port of the previous master, but just a different IP
address, non-leader Sentinels would not be able to recognize the
configuration change.
This commit fixes issue #1394.
Many thanks to @shanemadden that reported the bug and helped
investigating it.
At the end of the file, CONFIG REWRITE adds a comment line that:
# Generated by CONFIG REWRITE
Followed by the additional config options required. However this was
added again and again at every rewrite in praticular conditions (when a
given set of options change in a given time during the time).
Now if it was alrady encountered, it is not added a second time.
This is especially important for Sentinel that rewrites the config at
every state change.
Some are just to know if the master is down, and in this case the runid
in the request is set to "*", others are actually in order to seek for a
vote and get elected. In the latter case the runid is set to the runid
of the instance seeking for the vote.
Also the sentinel configuration rewriting was modified in order to
account for failover in progress, where we need to provide the promoted
slave address as master address, and the old master address as one of
the slaves address.
We'll use CONFIG REWRITE (internally) in order to store the new
configuration of a Sentinel after the internal state changes. In order
to do so, we need configuration options (that usually the user will not
touch at all) about config epoch of the master, Sentinels and Slaves
known for this master, and so forth.
The time Sentinel waits since the slave is detected to be configured to
the wrong master, before reconfiguring it, is now the failover_timeout
time as this makes more sense in order to give the Sentinel performing
the failover enoung time to reconfigure the slaves slowly (if required
by the configuration).
Also we now PUBLISH more frequently the new configuraiton as this allows
to switch the reapprearing master back to slave faster.
Also defaulf failover timeout changed to 3 minutes as the failover is a
fairly fast procedure most of the times, unless there are a very big
number of slaves and the user picked to configure them sequentially (in
that case the user should change the failover timeout accordingly).
Once we switched configuration during a failover, we should advertise
the new address.
This was a serious race condition as the Sentinel performing the
failover for a moment advertised the old address with the new
configuration epoch: once trasmitted to the other Sentinels the broken
configuration would remain there forever, until the next failover
(because a greater configuration epoch is required to overwrite an older
one).
Now Sentinel believe the current configuration is always the winner and
should be applied by Sentinels instead of trying to adapt our view of
the cluster based on what we observe.
So the only way to modify what a Sentinel believe to be the truth is to
win an election and advertise the new configuration via Pub / Sub with a
greater configuration epoch.