-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SolrCloud Pod moved to new Node - Replica Migration pending #668
Comments
Solr Operator logs
|
changing the configs and eliminated the dependency on ZK being on same node as solrcloud pod. Looking at other PODs, this respective file folder contents are written by "#Written by CorePropertiesLocator" . So Q -
|
bump... |
Newer versions of Solr do not have an AutoAddReplica feature. What kind of persistent volumes are you using? The data should not be missing when the pod is restarted. That's a failure of kubernetes/your PVC, and the Solr Operator isn't built to handle that. When you are running with Persistent Data it will expect the data to be there when restarted. If you are running with ephemeral data, it will remove the data from the node before killing the pod. It can get into a bad state if the pod is killed on its own, and the data isn't moved beforehand. |
I have tested using both Persistent and Ephemeral resulting in loss of data for that solr node. For Persistent storage , using the local volume provisioner. As long as the POD comes back in the same EKS node it binds to the PVC to the same PV and retains the data. But when the POD gets scheduled to another EKS node (which is my scenario) the data is lost. the folders/directory for the core config on that replaced POD is void of any data. |
The only ways that local volumes work as PVs is if the PVs that are created have node limitations (i.e. the Pod connected to the PV cannot be rescheduled onto another node). Are you sure that the local volume provisioner is setup correctly? |
yes, PVs are setup correctly. Solr PODS come up correctly either by evicting or restarting. The specific scenario I'm certifying is EKS node taint n drain ( replace node with new hardware). So it appears from your description, Operator will NOT move the data as the local volume is tied to a EKs node. Is there a recommendation to manually trigger data replication from other 2 nodes ? like a API call to fulfill the data $SOLR_HOME/data/<<collection_shardN_replica_nN>> |
Ahhh yes during node draining. That is a problem. Yes, that is correct. What I would do is issue a Replace node command, moving all of the replicas off of the data-less pod. Then you can do a balance after that to move replicas back onto that pod. It would be nice to have a command to fix all of the data on broken replicas. Maybe I'll make a JIRA for that. One thing the operator can do is notice that a PV has changed, and if so automate the replica moving to restore data. Can you confirm that the PVs that are tied to the Solr PVCs change after draining the node? If so we can watch those PVs and try to fix the data if they are changed. (i.e. the data might be gone) |
Environment:
Solr Operator Helm : 0.8.0
Solr 9.4 container image
3 node cluster
Persistent storage option (w/ localvolume provisioner)
Managed upgrade strategy.
The K8S node for solrcloud-0 got cordon and Pod was moved to a new node.
When the pod came up on new node, its recognized by as part of the SolrCloud statefulset, but at the Collection level replica was lost on the node. Looking at the Stateful , there's a cluster lock.
Plz allow me to ask if im missing a step in the process...
Should the operator automatically do the replica migration?
I have read about Rebalance API and using 9.4 version.
Is there a way to manually kick off the replica migration step to that specific POD?
SolrCloud custom definition.
The text was updated successfully, but these errors were encountered: