[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [cobalt-security] amd root?



Gary wrote:
> I've experienced the same behavior on a couple of my RaQ 3i's.
> The system completely locks up, only responding to pings, 
> but otherwise dead. I cannot replicate the behavior and I
> have no other reasons to suspect malicious behavior.

This is not _necessarily_ due to to malicious or otherwise nefarious
activity. It's very simple: most RaQ owners/customers have either 64, 128 or
256MB of RAM installed (I know there are 512s out there, but not too many in
my experience).

Cobalt's "limit" is 250 virtual sites. This is, under most circumstances,
just fine - until you get a single very busy site. The daily log crunch to
generate the site statistics is (again, in my experience) a very common
cause of the "help! my RaQ locked up but I can still ping it!" syndrome.

And the reason? RAM famine.

The log crunching process(es) take up enough RAM as it is, but if you end
with (say) a 120MB+ logfile - and that's just an example, YMMV - then the
scripts which process the logs end up consuming so much RAM that the kernel
is forced to do what's known as "aggressive swapping".

In that position, it swaps every single idle process out, but ends up in a
position where it simply cannot swap them back in again; so your webserver
can't server web pages, inetd won't let you telnet in, sshd will listen but
won't be able to spin off an environment for you... and in extreme
circumstances the kswapd process ends up stuck in swap itself, in which case
you may see "do_try_to_free_pages failed" kernel errors in your messages log
after a reboot. If, that is, syslogd or klogd haven't been swapped out or
killed at that point ;-)

And there's the reason why the front panel doesn't work: there's no process
waiting for input any more, so the buttons become less than useful. The
machine still pings because the kernel, and hence the IP stack, is still
alive.
The only way out, as already noted, is a hard reset.

Note also that this syndrome can easily be caused by scripting, or in some
circumstances 'malicious' activity. I say 'malicious' because I'm not
talking about someone doing bad things to your server on purpose, they're
just trying to run something that shouldn't be there.

# man fork

should give you some idea, especially the second paragraph of the
description. It's trivial to cause this - it amounts to a local DoS.

HTH

Graeme
-- 
Graeme Fowler
System Administrator
Host Europe Group PLC