Project

General

Profile

Actions

Anomalie #4601

closed

Redémarrage difficile des vm coon

Added by Christian P. Momon over 4 years ago. Updated over 4 years ago.

Status:
Fermé
Priority:
Élevée
Assignee:
Christian P. Momon
Category:
-
Start date:
07/15/2020
Due date:
% Done:

0%

Estimated time:

Description

Suite à une simple mise à jour de sécurité, de grosses difficultés à redémarrer coon.

Quelques traces :

Voici quelques traces :
<pre>
=(^-^)=root@coon:/etc/libvirt/qemu# for host in $(ls *xml | sed -e 's/.xml//g'| grep -v modele) ; do virsh start $host ; done
error: Failed to start domain admin
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain allo
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain bastion
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain bla
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain dns
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain drop
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain lamp
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain libreoffice
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain ludo
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain mail
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain pad
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain pouet
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain sympa
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain valise
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain xmpp
error: Requested operation is not valid: network 'default' is not active

=================================================================================================

En regardant Virtmanager > coon > réseaux virtuels > default > État, il est indiqué « inactif ».
Si activation alors :
Erreur lors du démarrage du réseau « default »: internal error: Network is already in use by interface virbr0
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/network.py", line 76, in start
    self._backend.create()
  File "/usr/lib/python3/dist-packages/libvirt.py", line 2996, in create
    if ret == -1: raise libvirtError ('virNetworkCreate() failed', net=self)
libvirt.libvirtError: internal error: Network is already in use by interface virbr0

=================================================================================================

juil. 15 00:57:44 coon.chapril.org icinga2[1305]: [2020-07-15 00:57:44 +0200] information/ApiListener: Finished reconnecting to endpoint 'admin.cluster.chapril.org' via host 'admin.cluster.chapril.org' and port '5665'
juil. 15 00:57:44 coon.chapril.org systemd[1]: Reloading Postfix Mail Transport Agent.
juil. 15 00:57:44 coon.chapril.org systemd[1]: Reloaded Postfix Mail Transport Agent.
juil. 15 00:57:44 coon.chapril.org kernel: e1000e: enp1s0 NIC Link is Down
juil. 15 00:57:44 coon.chapril.org kernel: virbr0: port 2(enp1s0) entered disabled state
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: libvirt version: 5.0.0, package: 4+deb10u1 (Guido Günther <agx@sigxcpu.org> Thu, 05 Dec 2019 00:22:14 +0100)
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: hostname: coon.chapril.org
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: Network name='default' uuid=0f6e21a4-a6e3-45fe-af5b-3af5361ec327 is tainted: hook-script
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: internal error: Network is already in use by interface virbr0
juil. 15 00:57:45 coon.chapril.org kernel: drop UNMATCHED IN-external_trIN=enp0s31f6 OUT= MAC=90:1b:0e:cb:cd:12:40:71:83:a5:f1:d0:08:00 SRC=185.39.11.32 DST=94.130.8.3 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=60789 PROTO=TCP SPT=41728 DPT=622 WINDOW=1024 RES=0x00 SYN URGP=0 
juil. 15 00:57:52 coon.chapril.org sshd[2762]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=222.186.173.226  user=root
juil. 15 00:57:52 coon.chapril.org systemd[1]: systemd-fsckd.service: Succeeded.
juil. 15 00:57:52 coon.chapril.org kernel: drbd coon: bind before connect failed, err = -99
juil. 15 00:57:52 coon.chapril.org kernel: drbd coon: conn( WFConnection -> Disconnecting ) 
juil. 15 00:57:53 coon.chapril.org kernel: drop UNMATCHED IN-external_trIN=enp0s31f6 OUT= MAC=90:1b:0e:cb:cd:12:40:71:83:a5:f1:d0:08:00 SRC=93.174.93.123 DST=94.130.8.3 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=57905 PROTO=TCP SPT=43411 DPT=11395 WINDOW=1024 RES=0x00 SYN URGP=0 
juil. 15 00:57:54 coon.chapril.org sshd[2762]: Failed password for root from 222.186.173.226 port 62800 ssh2
juil. 15 00:57:54 coon.chapril.org icinga2[1305]: [2020-07-15 00:57:54 +0200] information/ApiListener: Reconnecting to endpoint 'admin.cluster.chapril.org' via host 'admin.cluster.chapril.org' and port '5665'
juil. 15 00:57:54 coon.chapril.org FireHOL[2965]: FireHOL started from '/' with: /usr/sbin/firehol start
juil. 15 00:57:54 coon.chapril.org FireHOL[2967]: Saving active firewall to a temporary file started
juil. 15 00:57:54 coon.chapril.org delayed_fw_reload[2714]: FireHOL: Saving active firewall to a temporary file...  OK
juil. 15 00:57:54 coon.chapril.org FireHOL[2979]: Saving active firewall to a temporary file succeeded
juil. 15 00:57:54 coon.chapril.org FireHOL[2980]: Processing file '/etc/firehol/firehol.conf' started
juil. 15 00:57:55 coon.chapril.org drbd[1324]: WARN: stdin/stdout is not a TTY; using /dev/console
juil. 15 00:57:55 coon.chapril.org drbd[1324]: .
juil. 15 00:57:55 coon.chapril.org systemd[1]: Started LSB: Control DRBD resources..
juil. 15 00:57:55 coon.chapril.org kernel: drbd maine: bind before connect failed, err = -99
juil. 15 00:57:55 coon.chapril.org kernel: drbd maine: conn( WFConnection -> Disconnecting )
</pre>


Related issues 2 (0 open2 closed)

Related to Infra Chapril - Anomalie #3603: Rédémarrage difficile des vm de coonFerméChristian P. Momon02/19/2019

Actions
Related to Infra Chapril - Demande #4599: Ouverture de ports pour ludoFerméFrançois Poulain07/14/2020

Actions
Actions #1

Updated by Christian P. Momon over 4 years ago

  • Description updated (diff)
Actions #2

Updated by Christian P. Momon over 4 years ago

  • Status changed from Nouveau to En cours de traitement
  • Priority changed from Normale to Élevée

Pour débloquer la situation, il a fallu annuler des modifs de configuration de minetest dans firehol.conf alors que ce dernier ne présente absolument aucune source potentielle d'anomalie.

Actions #3

Updated by Christian P. Momon over 4 years ago

Commentaire de PoluX sur #4599 :

Après une longue investigation avec Christian, on a tiqué sur :
Jul 15 00:01:20 coon delayed_fw_reload[1616]: ERROR: FireHOL is already running. Exiting...
On soupçonne une race condition qui apparait du fait que les regles iptables deviennent nombreuses (4000).
En regardant la conf systemd de libvirtd sur coon on tombe sur :
  Drop-In: /etc/systemd/system/libvirtd.service.d
           └─override.conf
[...]
  Process: 1441 ExecStartPost=/usr/local/bin/delayed_fw_reload (code=exited, status=0/SUCCESS)
Ça ressemble à une cochonceté oubliée, datant des premiers jours du cluster. Depuis, les hooks network on rendu ça inutiles.
Par ailleurs le code impliqué est plus que naïf.
# cat /usr/local/bin/delayed_fw_reload
#!/bin/bash

sleep 10 && firehol start

Je dégage tout ça de coon. Maine en est exempt.
Actions #4

Updated by Christian P. Momon over 4 years ago

Le test de redémarrage avec les modifs firehost ré-intégrées donne un résultat nominal. Donc nous validons que ça venait de ça \o/

Actions #5

Updated by Christian P. Momon over 4 years ago

  • Status changed from En cours de traitement to Résolu

PoluX met évidence que le problème existait depuis un moment, on retrouve les symptômes dans les tickets #3603, #3734 et le courriel du 08/10/2019…

Donc, cool d'avoir trouvé et corrigé !

Actions #6

Updated by Christian P. Momon over 4 years ago

  • Related to Anomalie #3603: Rédémarrage difficile des vm de coon added
Actions #7

Updated by Christian P. Momon over 4 years ago

Actions #8

Updated by Christian P. Momon over 4 years ago

  • Status changed from Résolu to Fermé
Actions #9

Updated by Christian P. Momon over 4 years ago

  • Target version changed from Backlog to Sprint 2020 été
Actions

Also available in: Atom PDF