[Cialug] Spontaneous Outbreak of Read-Only-ness

Wed Feb 24 14:47:14 CST 2016

Solved!

Yes, these were VMs. I started getting the idea that it was SAN. I found an
article where someone had changed kernel parameters to flush cache to disk
more often, and that fixed it for him/her. But eventually I found that
/var/log/messages had several messages like this:

Feb 20 22:54:07 hostname kernel: INFO: task sadc:39225 blocked for more
than 120 seconds.
Feb 20 22:54:07 hostname kernel:      Not tainted
2.6.32-573.18.1.el6.x86_64 #1
Feb 20 22:54:07 hostname kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 20 22:54:07 hostname kernel: sadc          D 0000000000000005     0
39225  39222 0x00000084
Feb 20 22:54:07 hostname kernel: ffff880335123cc8 0000000000000086
0000000000000000 ffff880332c5bb00
Feb 20 22:54:07 hostname kernel: ffff880335123c88 ffffffffa0004d9f
000155f5d9538299 ffff880300000000
Feb 20 22:54:07 hostname kernel: 7ffffffffffffffd 000000011664300d
ffff8803317b85f8 ffff880335123fd8
Feb 20 22:54:07 hostname kernel: Call Trace:
Feb 20 22:54:07 hostname kernel: [<ffffffffa0004d9f>] ?
dm_table_unplug_all+0x5f/0x100 [dm_mod]
Feb 20 22:54:07 hostname kernel: [<ffffffff81127540>] ? sync_page+0x0/0x50
Feb 20 22:54:07 hostname kernel: [<ffffffff81127540>] ? sync_page+0x0/0x50
Feb 20 22:54:07 hostname kernel: [<ffffffff81539673>] io_schedule+0x73/0xc0
Feb 20 22:54:07 hostname kernel: [<ffffffff8112757d>] sync_page+0x3d/0x50
Feb 20 22:54:07 hostname kernel: [<ffffffff8153a13f>]
__wait_on_bit+0x5f/0x90
Feb 20 22:54:07 hostname kernel: [<ffffffff811277b3>]
wait_on_page_bit+0x73/0x80
Feb 20 22:54:07 hostname kernel: [<ffffffff810a14e0>] ?
wake_bit_function+0x0/0x50
Feb 20 22:54:07 hostname kernel: [<ffffffff8113d8d5>] ?
pagevec_lookup_tag+0x25/0x40
Feb 20 22:54:07 hostname kernel: [<ffffffff81127bdb>]
wait_on_page_writeback_range+0xfb/0x190
Feb 20 22:54:07 hostname kernel: [<ffffffff8113c961>] ?
do_writepages+0x21/0x40
Feb 20 22:54:07 hostname kernel: [<ffffffff81127d2b>] ?
__filemap_fdatawrite_range+0x5b/0x60
Feb 20 22:54:07 hostname kernel: [<ffffffff81127da8>]
filemap_write_and_wait_range+0x78/0x90
Feb 20 22:54:07 hostname kernel: [<ffffffff811c4aae>]
vfs_fsync_range+0x7e/0x100
Feb 20 22:54:07 hostname kernel: [<ffffffff811c4b9d>] vfs_fsync+0x1d/0x20
Feb 20 22:54:07 hostname kernel: [<ffffffff811c4bde>] do_fsync+0x3e/0x60
Feb 20 22:54:07 hostname kernel: [<ffffffff811c4c13>]
sys_fdatasync+0x13/0x20
Feb 20 22:54:07 hostname kernel: [<ffffffff8100b0d2>]
system_call_fastpath+0x16/0x1b

Aha! And February 20th 11 pm was about the time there were SAN issues. Of
course! A reboot has fixed the issues and I'm guessing it'll stay fixed.
It's not a system problem, it was SAN. (Why didn't I figure this out right
away, you may ask? Because I didn't realize what I was looking at in
messages.)

By the way, in my searching I found this:
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Very interesting. Might help your situation, Sean.

--
Todd

On Wed, Feb 24, 2016 at 12:46 PM, Sean Flattery <sean.r.flattery at gmail.com>
wrote:

> I've had this happen when the VM couldn't talk to the SAN hosting the file
> system and would go read only due to IO wait. The high IO load was caused
> by other noisy neighbors so it was tough to track down. If things aren't
> virtualized, then I wholeheartedly endorse the advice to check your disks
> and make sure you have good backups.
>
> Thanks
> Sean Flattery
>
> Date: Wed, 24 Feb 2016 11:24:27 -0600
> From: Todd Walton <tdwalton at gmail.com>
> To: Central Iowa Linux Users Group <cialug at cialug.org>
> Subject: [Cialug] Spontaneous Outbreak of Read-Only-ness
> Message-ID:
>         <
> CALm_Md9dSR1U8jkEWAKMi4qf+DtROLXg2XZzCozpp5pgm3WGiw at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> I've had two weird read-only-izings happen in the past 24 hours.
>
> First, a RHEL 6 box: Found I couldn't write files to /tmp, even as root.
> Further poking revealed that I couldn't write to / either. 'cat
> /proc/mounts' said / was rw. Changing SELinux to permissive didn't help.
> Rebooted. All is well.
>
> Then, a CentOS 6 box: Tried to update root password. It took the password
> twice and then said "passwd: Authentication token manipulation error". I
> tried to mv the shadow file, thinking maybe I'd re-shadow passwd, but it
> wouldn't let me move it because... read-only filesystem. Again, 'cat
> /proc/mounts' showed that that should not have been the case. I rebooted
> and all is now well.
>
> I can't troubleshoot these further right now because I made the problem go
> away. But anybody seen this before?
>
> --
> Todd
> _______________________________________________
> Cialug mailing list
> Cialug at cialug.org
> http://cialug.org/mailman/listinfo/cialug
>