Hi all,
We have this vAPP in vCloud that VMs have been working flawlessly since day 1. But after we updated our vSphere environment to version 6.0, VMs started rebooting themselves all the sudden and they wouldn't remain up for more than like 5 days at the most. Then we disabled VM monitoring on those VMs at the cluster's configuration and now HA won't reboot those VMs but we are seeing all kind of kernel trace calls, like for example:
Jul 5 06:19:35 compu kernel: hrtimer: interrupt took 42859741512 ns
Jul 5 06:19:36 compu kernel: INFO: rcu_sched self-detected stall on CPU
Jul 5 06:19:36 compu kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 0, t=69334 jiffies, g=9237069, c=9237068, q=5)
Jul 5 06:19:36 compu kernel: Task dump for CPU 3:
Jul 5 06:19:36 compu kernel: kworker/3:2 R running task 0 22078 2 0x00000088
Jul 5 06:19:36 compu kernel: sched: RT throttling activated
Jul 5 06:19:36 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]
Jul 5 06:19:36 compu kernel: { 3} (t=73140 jiffies g=9237069 c=9237068 q=5)
Jul 5 06:19:36 compu kernel: 0000000000000000 0000000000000200 000000003e25cd65 ffffffff81f0a540
Jul 5 06:19:36 compu kernel: 0000000000000202 0000000000000000 0000000000000000 ffff880716cd0000
Jul 5 06:19:36 compu kernel: ffff88041e9d3dc0 ffffffff811cf94a ffff880716cd0000 0000000000000000
Jul 5 06:19:36 compu kernel: Call Trace:
Jul 5 06:19:36 compu kernel: [<ffffffff811cf94a>] ? alloc_pages_current+0xaa/0x170
Jul 5 06:19:36 compu kernel: [<ffffffffa03b5ff8>] ? vmballoon_work+0x3b8/0x618 [vmw_balloon]
Jul 5 06:19:36 compu kernel: [<ffffffff810a845b>] ? process_one_work+0x17b/0x470
Jul 5 06:19:36 compu kernel: [<ffffffff810a9296>] ? worker_thread+0x126/0x410
Jul 5 06:19:36 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
Jul 5 06:19:36 compu kernel: [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0
Jul 5 06:19:36 compu kernel: [<ffffffff810bfde3>] ? finish_task_switch+0x53/0x180
Jul 5 06:19:36 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jul 5 06:19:36 compu kernel: [<ffffffff81697698>] ? ret_from_fork+0x58/0x90
Jul 5 06:19:36 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jul 5 06:19:36 compu kernel: Task dump for CPU 3:
Jul 5 06:19:36 compu kernel: kworker/3:2 R running task 0 22078 2 0x00000088
Jul 5 06:19:36 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]
Jul 5 06:19:36 compu kernel: ffff880716cd0000 000000003e25cd65 ffff880c3fd83c48 ffffffff810c47e8
Jul 5 06:19:36 compu kernel: 0000000000000003 ffffffff81a1e6c0 ffff880c3fd83c60 ffffffff810c80c9
Jul 5 06:19:36 compu kernel: 0000000000000004 ffff880c3fd83c90 ffffffff81137960 ffff880c3fd901c0
Jul 5 06:19:36 compu kernel: Call Trace:
Jul 5 06:19:36 compu kernel: <IRQ> [<ffffffff810c47e8>] sched_show_task+0xa8/0x110
Jul 5 06:19:36 compu kernel: [<ffffffff810c80c9>] dump_cpu_task+0x39/0x70
Jul 5 06:19:36 compu kernel: [<ffffffff81137960>] rcu_dump_cpu_stacks+0x90/0xd0
Jul 5 06:19:36 compu kernel: [<ffffffff8113b0b2>] rcu_check_callbacks+0x442/0x720
Jul 5 06:19:36 compu kernel: [<ffffffff810f3610>] ? tick_sched_handle.isra.13+0x60/0x60
Jul 5 06:19:36 compu kernel: [<ffffffff81099697>] update_process_times+0x47/0x80
Jul 5 06:19:36 compu kernel: [<ffffffff810f35d5>] tick_sched_handle.isra.13+0x25/0x60
Jul 5 06:19:36 compu kernel: [<ffffffff810f3651>] tick_sched_timer+0x41/0x70
Jul 5 06:19:36 compu kernel: [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260
Jul 5 06:19:36 compu kernel: [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0
Jul 5 06:19:36 compu kernel: [<ffffffff81050ff7>] local_apic_timer_interrupt+0x37/0x60
Jul 5 06:19:36 compu kernel: [<ffffffff81699e4f>] smp_apic_timer_interrupt+0x3f/0x60
Jul 5 06:19:36 compu kernel: [<ffffffff8169839d>] apic_timer_interrupt+0x6d/0x80
Jul 5 06:19:36 compu kernel: [<ffffffff8108f5e8>] ? __do_softirq+0x98/0x280
Jul 5 06:19:36 compu kernel: [<ffffffff810f12fb>] ? clockevents_program_event+0x6b/0xf0
Jul 5 06:19:36 compu kernel: [<ffffffff816991dc>] call_softirq+0x1c/0x30
Jul 5 06:19:36 compu kernel: [<ffffffff8102d365>] do_softirq+0x65/0xa0
Jul 5 06:19:36 compu kernel: [<ffffffff8108f9d5>] irq_exit+0x115/0x120
Jul 5 06:19:36 compu kernel: [<ffffffff81699e55>] smp_apic_timer_interrupt+0x45/0x60
Jul 5 06:19:36 compu kernel: [<ffffffff8169839d>] apic_timer_interrupt+0x6d/0x80
Jul 5 06:19:36 compu kernel: <EOI> [<ffffffff8118ac84>] ? get_page_from_freelist+0x434/0x9f0
Jul 5 06:19:36 compu kernel: [<ffffffff8118b3b6>] __alloc_pages_nodemask+0x176/0x420
Jul 5 06:19:36 compu kernel: [<ffffffff811cf94a>] alloc_pages_current+0xaa/0x170
Jul 5 06:19:36 compu kernel: [<ffffffffa03b5ff8>] vmballoon_work+0x3b8/0x618 [vmw_balloon]
Jul 5 06:19:36 compu kernel: [<ffffffff810a845b>] process_one_work+0x17b/0x470
Jul 5 06:19:36 compu kernel: [<ffffffff810a9296>] worker_thread+0x126/0x410
Jul 5 06:19:36 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
Jul 5 06:19:36 compu kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jul 5 06:19:37 compu kernel: [<ffffffff810bfde3>] ? finish_task_switch+0x53/0x180
Jul 5 06:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jul 5 06:19:37 compu kernel: [<ffffffff81697698>] ret_from_fork+0x58/0x90
Jul 5 06:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jul 6 02:19:37 compu kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 1, t=75951 jiffies, g=11716833, c=11716832, q=7)
Jul 6 02:19:37 compu kernel: Task dump for CPU 3:
Jul 6 02:19:37 compu kernel: kworker/3:0 R running task 0 3126 2 0x00000088
Jul 6 02:19:37 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]
Jul 6 02:19:37 compu kernel: ffff880291e33d98 0000000000000046 ffff880700474e70 ffff880291e33fd8
Jul 6 02:19:37 compu kernel: ffff880291e33fd8 ffff880291e33fd8 ffff880700474e70 ffff880291e30000
Jul 6 02:19:37 compu kernel: 00000000000000ab 00000000000000ab 0000000000000000 ffffffffa03b8558
Jul 6 02:19:37 compu kernel: Call Trace:
Jul 6 02:19:37 compu kernel: [<ffffffff810c2076>] __cond_resched+0x26/0x30
Jul 6 02:19:37 compu kernel: [<ffffffff8168c9ea>] _cond_resched+0x3a/0x50
Jul 6 02:19:37 compu kernel: [<ffffffffa03b6029>] vmballoon_work+0x3e9/0x618 [vmw_balloon]
Jul 6 02:19:37 compu kernel: [<ffffffff810a845b>] process_one_work+0x17b/0x470
Jul 6 02:19:37 compu kernel: [<ffffffff810a9296>] worker_thread+0x126/0x410
Jul 6 02:19:37 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
Jul 6 02:19:37 compu kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jul 6 02:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Jul 6 02:19:37 compu kernel: [<ffffffff81697698>] ret_from_fork+0x58/0x90
Jul 6 02:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
I opened a case with VMware support looking for help regarding this issue, but all I got was a "go check out the OS or the applications" response. To be honest issues started next day after we updated to vSphere 6.0. We updated only half of the nodes in the cluster so the other half was still running on 5.5. furthermore, we split VMs 50-50 on 6.0 and 5.5, and the VMs hosted on 5.5 nodes wouldn't shown that kind of behavior. Since we eventually had to complete the update on the rest of the nodes they are all now running on version 6.0 and the issue is far from gone.
The issue in question comes in for VMs within this vAPP. The vCloud account has more than enough resources to provide for the VMs in that pool reservation. The problem happens only on VMs running CentOS 7 with kernel 3.10 or 3.11.
Now, I read something on a different post somewhere else that it might be an incompatibility issue between the hypervisor and said kernel version, as the Linux flavor running on those VMs should be the last or our concerns. Before trying installing kernel 4.x from unofficial repositories, I was wondering if someone has encountered with the same situation, and want to share their approach if the actually troubleshooted or are having the same problem so we can share information and hopefully nail this down.
Thanks in advance.