Repeating Outages


#1

I’m getting desperate.

i have a BigV server with 2Gig memory that serves a few Wordpress and Joomla websites plus a little used Ruby on Rails App.

Several times a day, the whole thing locks up with a high system load and would ultimately die if it were not for a ‘monit’ job that spots the high load and starts killing things until it recovers. Outages last between 2 and 20 mins, depending on how long it takes for the system load to drop.

The server sits on top of a luks whole disk encryption scheme, with LVM on top.

I’m guessing that this is to much of an overhead, given that there’s already some kind of volume management carried out by Bytemark to provision the ‘disks’ in the first place.

Does anyone else do this and do you have issues?

I’m contemplating removing the encryption layer, but if there’s a simple tweak somewhere to be done, I’d prefer not to.

If it’s a choice between encryption and LVM, I’d prefer to retain LVM for snapshots and rolling back ‘stuff’.

I was an early adopter of BigV, and after its initial teething trouble, this setup settled down and ran fine for months.

It is now particularly bad, and I need to do something about it.

Ideas?


#2

Have you tried asking Bytemark if there any known problems with your server’s tail around the times it’s suffering high system load? SSHing in and leaving a local terminal running dstat -tcdnmgy gives you a record of its descent when it next crashes, and may give further clues. Do the sites’ access logs show anything unusual in the lead up to a crash?


#3

I’ve had several tickets on this over the last few months. Nothing seems to correlate.

There’s nothing obvious in the logs. The system load seems to sky-rocket for no real reason. Obviously there is a reason, but it cripples things so quickly that I usually can’t get on fast enough to spot it, after I get a high load alert from ‘monit’.

I’m presuming that as I can’t see anything in the server logs, it must be something in the underlying infrastructure. Or at least a higher than usual load on the server that pushes the infrastructure over the edge.

I run this kind of LVM on top of encryption at home, and it’s no problem at all. But I am much closer to the bare metal, of course.

I will try the dstat thing on the serial console.

I get the occasional website ‘attacks’ with bots trying to repeatedly login to CMSs occasionally. This does impact things a little. I deal with those via the firewall and ‘ipset’. Works quite well, if the attack is prolonged.

I could do with finding something to lock down the source IPs automatically really. That would probably improve my life somewhat.

Thanks for your help.


#4

I would guess that one or more of your sites has been compromised. Worth checking the logs, mailq, maldet (if you have it installed) etc to see what might be going on. Some of the current things that are compromising CMS packages in particular run for a few minutes every hour or two, usually sending out a flood of emails in that period. Had similar myself a couple of times and it took quite a bit of effort tracking down exactly what files had been hit.


#5

Other than being subject to a few sites having repeated log in attempts, I think I’m OK.

I try to keep the CMSs up to date. I’ll look at mail traffic as a pointer. I get a whole load of incoming spam, which causes Clamav to get busy sometimes. Ordinarily, there’s not much outgoing mail. I have a few alias redirects and half a dozen low volume mailing lists.

In a way, I’d prefer a compromise to an infrastructure problem! I love BigV, and I like it the way I’ve configured it. I can probably deal with an infestation, but I can’t afford a dedicated box.

Thanks for your suggestions.


#6

I find it interesting that it ran so well for months before it started. I will be vey interested to find out what the issue is when you finally get it sorted, I am a bit odd like that lol.


#7

Just finished a clamscan of the file system. Nothing other than quarantined banned emails (now deleted) and some PUAs that seem legit.

I think it’s always been delicate. With 2Gig of memory, I think it’s well spec’d for what it does too. That’s why I’ve been suspecting the NAS behind the scenes.

I don’t really want to too many intrusive diagnostics in order to avoid pissing people off. It might be worth running some kind of file system benchmark though.


#8

Might be worth setting up a second BigV instance over a couple of weeks. Move the Joomla sites to it one by one (really quick with a tool like Akeeba as it does it all in one) and see if you can reduce the load. At least that way your sites will have less outages and keep your customers happy. I prefer to keep my CMS packages on different instances as it allows better optimisation, and less risk to specific hacks. Having said that I have one site that uses three different CMS packages for its various elements which is a right royal pain. Yes, it does up the cost slightly, but not by much if you balance the resources carefully.


#9

Are you aware monit can run commands in addition to sending the alert? See exec at https://mmonit.com/monit/documentation/monit.html#ACTION. You could ps xauww to snapshot CPU load and memory usage at that point. I suspect clamav. There’s another thread on these forums complaining about its occasional memory-hog behaviour, and I’ve access to several Bytemark VMs where it’s configured. The box will be running fine, disk data cached in RAM, then clamav does one of its periodic activities, using lots of memory, and even if there’s CPU to spare the aftermath means a lot of cache refilling.

AIUI, clamav holds its mass of patterns in memory for some activities, they are ever growing as there’s more to match over time, and its authors don’t think it should access from disk instead because the nature of its work pattern is random access. It’s becoming unfit for purpose in some cases and I was wondering about having a dedicated mail/clamav VM for all the different, otherwise unrelated, domains to stop it impacting on web/DB.


#10

Curiously good timing.

I have decided that enough is enough. I’m pretty sure that it’s a combination of too many layers of disk abstraction on top of clamav and friends.

I have just provisioned a new BigV server without full disk encryption (which I believe to be the underlying problem). I am currently migrating my domains across.

I am considering using my first machine as a stand-alone amavisd-new server.

The word on the street is that hosting providers shouldn’t do email filtering on behalf of clients as it is a ‘personal’ thing, best left to clients. However, if I were to unleash the torrent of spam onto my clients I would very quickly get into trouble.

Maybe there’s scope for a community BigV amavisd server that we could share on an equitable basis. I don’t know how you would practically do the accounting though.


#11

I found that there were a few domains that are known for bruteforce attacks and they were scanning the web server, emails (postfix/sasl/dovecot) and FTP. I ramped up the fail2ban periods for 2 hours instead of a few minutes and reduced the failure limit to 3 attempts in 10 minutes. The load average on the server dropped from over 3 to about 1. I then ran

awk ‘($(NF-1) = /Ban/){print $NF}’ /var/log/fail2ban.log | sort | uniq -c | sort -n

To get a list of repeat offenders who I then checked at abuseipdb.com and I then put the worst offenders (mostly in China / Russia / Ukraine) into my firewall’s permanent rules

iptables -I INPUT -s 221.194.44.0/24 -j DROP
iptables -I INPUT -s 121.18.238.0/24 -j DROP
iptables -I INPUT -s 116.10.116.0/24 -j DROP
iptables -I INPUT -s 116.31.116.0/24 -j DROP
iptables -I INPUT -s 120.25.175.0/24 -j DROP
iptables -I INPUT -s 81.0.91.0/24 -j DROP
iptables -I INPUT -s 49.82.14.0/24 -j DROP

The load average on the server dropped below 1 and now runs at about 0.7


#12

But wouldn’t disk-encryption’s overhead be fairly constant as it’s a block at a time? I still think having dstat -tcdnmgy chugging away in the background allows retrospective investigation into the name, and time, of problems, e.g. was it high CPU load, or high wait (for I/O), or perhaps you can see free memory shoot up when something (clamav) exits and then gradually fall again as the cache is reloaded.


#13

Here we go. I was running dstat and just received an email from monit saying the server was unresponsive. Had a look at the dstat job. It had just locked up (no longer updating) with the following last page of info. Seemed to be going well and then bang. Dead. There’s a swap out of 82Meg on the last line, but I have no idea whether that’s a lot or not. CPU wait went very high, very quickly too.

21-02 11:01:09|  0   0   0 100   0   0|   0     0 | 132B  436B|   0     0 >
21-02 11:01:10|  0   0   0 100   0   0|   0    36k| 192B  436B|   0     0 >
21-02 11:01:11|  0   0   0 100   0   0|   0   120k| 192B  782B|   0     0 >
21-02 11:01:12|  0   0  87  13   0   0|   0   212k| 312B  201B|   0     0 >
21-02 11:01:13|  0   0  94   6   0   0|   0    60k| 473B  452B|   0     0 >
21-02 11:01:14|  0   0 100   0   0   0|   0     0 | 252B  798B|   0     0 >
21-02 11:01:15|  0   0 100   0   0   0|   0     0 | 676B  201B|   0     0 >
21-02 11:01:16|  1   0  99   0   0   0|   0     0 | 192B  436B|   0     0 >
21-02 11:01:17|  0   0 100   0   0   0|   0     0 | 252B  330B|   0     0 >
21-02 11:01:18|  0   0 100   0   0   0|   0     0 | 413B  436B|   0     0 >
21-02 11:01:19|  0   0 100   0   0   0|   0     0 | 287B  436B|   0     0 >
21-02 11:01:20|  0   0  98   2   0   0|   0    32k| 192B  872B|   0     0 >
21-02 11:01:21|  0   0 100   0   0   0|   0     0 | 312B  547B|   0     0 >
21-02 11:01:22|  1   0  99   0   0   0|   0     0 | 252B  201B|   0     0 >
----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
     time     |usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
21-02 11:01:23|  0   0 100   0   0   0|   0     0 | 831B  766B|   0     0 >
21-02 11:01:24|  0   0  99   1   0   0|4096B    0 |1414B 2152B|   0     0 >
21-02 11:01:25|  0   0 100   0   0   0|   0     0 | 730B  201B|   0     0 >
21-02 11:01:26|  0   1  99   0   0   0|   0     0 | 730B  346B|   0     0 >
21-02 11:01:27|  0   0 100   0   0   0|   0     0 | 192B  542B|   0     0 >
21-02 11:01:28|  0   0  98   2   0   0|   0    56k| 438B  436B|   0     0 >
21-02 11:01:29|  6  15  51  28   0   0|2124k  228k| 312B  436B|   0   988k>
21-02 11:01:30|  1   9   0  90   0   0| 112k   21M|2698B  452B|  92k   82M>

#14

For those playing along at home, here’s that same data reformatted. (Use three backticks in a row on a line by themselves to start and stop.)

    ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
    time          |usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
    21-02 11:01:09|  0   0   0 100   0   0|    0    0 | 132B  436B|   0     0 >
    21-02 11:01:10|  0   0   0 100   0   0|    0   36k| 192B  436B|   0     0 >
    21-02 11:01:11|  0   0   0 100   0   0|    0  120k| 192B  782B|   0     0 >
    21-02 11:01:12|  0   0  87  13   0   0|    0  212k| 312B  201B|   0     0 >
    21-02 11:01:13|  0   0  94   6   0   0|    0   60k| 473B  452B|   0     0 >
    21-02 11:01:14|  0   0 100   0   0   0|    0    0 | 252B  798B|   0     0 >
    21-02 11:01:15|  0   0 100   0   0   0|    0    0 | 676B  201B|   0     0 >
    21-02 11:01:16|  1   0  99   0   0   0|    0    0 | 192B  436B|   0     0 >
    21-02 11:01:17|  0   0 100   0   0   0|    0    0 | 252B  330B|   0     0 >
    21-02 11:01:18|  0   0 100   0   0   0|    0    0 | 413B  436B|   0     0 >
    21-02 11:01:19|  0   0 100   0   0   0|    0    0 | 287B  436B|   0     0 >
    21-02 11:01:20|  0   0  98   2   0   0|    0   32k| 192B  872B|   0     0 >
    21-02 11:01:21|  0   0 100   0   0   0|    0    0 | 312B  547B|   0     0 >
    21-02 11:01:22|  1   0  99   0   0   0|    0    0 | 252B  201B|   0     0 >
    21-02 11:01:23|  0   0 100   0   0   0|    0    0 | 831B  766B|   0     0 >
    21-02 11:01:24|  0   0  99   1   0   0|4096B    0 |1414B 2152B|   0     0 >
    21-02 11:01:25|  0   0 100   0   0   0|    0    0 | 730B  201B|   0     0 >
    21-02 11:01:26|  0   1  99   0   0   0|    0    0 | 730B  346B|   0     0 >
    21-02 11:01:27|  0   0 100   0   0   0|    0    0 | 192B  542B|   0     0 >
    21-02 11:01:28|  0   0  98   2   0   0|    0   56k| 438B  436B|   0     0 >
    21-02 11:01:29|  6  15  51  28   0   0|2124k  228k| 312B  436B|   0   988k>
    21-02 11:01:30|  1   9   0  90   0   0| 112k   21M|2698B  452B| 92k    82M>

It starts with 100% wait for several seconds, and we don’t know what it was doing before that. This means no CPU work could be done because everything was waiting for I/O. No paging is happening then or after, so we can say it’s not that, but the further interesting columns are missing, thus the >, so we can’t tell how memory allocation was changing as a result of I/O.

Then it’s quiet for about 12 seconds before a bit of CPU activity that needs the kernel, 15%, and that in turn probably causes the 28%, and then 90%, wait. It started to read 2124K from disk, paging out 998K, and then pages out even more, 82M, and yes, that’s quite a bit for a second in my BigV experience. Is that paging space on SSD or spinning rust Archive disks?

This doesn’t look like disk encryption overhead. When I’ve seen a BigV VM stall like this, e.g. runaway memory query in MySQL, then it can recover eventually as the I/O completes, assuming the kernel’s OOM Killer didn’t nobble something vital. But it’s often quicker to forcibly reboot. When it comes back up, check log files for around this time; I still blame ClamAV. :slight_smile:


#15

It’s SSD all the way (apart from the backups).

I assume that there are many layers of the disk sub-system involved, though.

  1. Raw disks
  2. RAID of some kind.
  3. Volume manager (LVM?)
  4. Volume allocation to VMs (iSCSI?)

then I do:

  1. full volume encryption
  2. Volume manager (LVM)
  3. Filesystem (currently ext3)

with the applications/OS on top.

That’s a lot. And it’s all shared (in so many ways).

The system recovers itself quite well. Monit starts killing apache and mysql (and restarts later). That seems to keep things tottering along without a reboot.


#16

There’s the SSDs and RAID on the tails. Then it’s the network using Network Block Device IIRC. And that’s a block device for your VM. Your VM’s CPU usage shows those layers you’ve added, e.g. ext3. I still think CPU load of those will be near constant and doesn’t explain your symptoms which are high waits, i.e. the block device interface to the kernel = NBD across the network. Today I bumped a non-Bytemark VM from 0.75 GiB to 1.5 GiB because clamd had got bigger since it was originally sized. Those missing dstat columns would show how memory is bucketed, e.g. used v. cached, and how they change in tandem with the waiting disk accesses.


#17

Well, it’s been a long time. The new BigV was nice and new and clean, and went well.

Then the alerts started again, system load going up into double digits for a time. Horrible.

It was looking like it might not have been the full disk encryption in itself that caused the problem. It was just a stab in the dark really.

I decided that there must have been something in all the complaints about ClamAV after all

I didn’t want to stop filtering it, but what can you do, short of shipping Bytemark a barrowful of cash every month for more memory to deal with the load.

I decided (perhaps radically) that the only thing that annoyed users more than a virus, was lots of spam. Counter-intuitive perhaps, but most people can deal with a few viruses, it’s heavy loads of spam that is a pain.

So I kept the spam filtering, and dropped the antivirus checking, and then added the following to stop spam from entering the system in the first place:

smtpd_client_restrictions = permit_mynetworks, permit_sasl_authenticated, reject_unauth_destination, reject_rbl_client zen.spamhaus.org, reject_rbl_client bl.spamcop.net, reject_rbl_client cbl.abuseat.org, permit'

Spam has dropped to a negligible amount. System load is next to nothing. No more ‘bongs’ in the night.

I’m a lot more relaxed. I’d like to do proper AV filtering, but it doesn’t seem necessary.


#18

Tony,
To which file have you added:

smtpd_client_restrictions = permit_mynetworks,
permit_sasl_authenticated,
reject_unauth_destination,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client bl.spamcop.net,
reject_rbl_client cbl.abuseat.org,
permit’

Regards Pete


#19

Postfix config file: /etc/postfix/main.cf


#20

Hi Tony

As I understand it, cbl.abuseat.org is included in Spamhaus XBL - which is a subset of ZEN, so you could save a lookup.

(Spamcop and dnsbl.sorbs.net are giving false-positives here for the likes of facebookmail - which may or may not matter).

I’ve also dropped clamd – trying to debug what I still believe to have been an ActiveSync horror show - but I’m fairly confident a 2GB symbiosis machine can handle it for several+ months.