Tag Archive for 'amazon'

The Linux OOM Killer

Most Linux distributions allow processes to request more memory that what is available in the system. The logic behind the approval is that generally the allocated memory is not used up immediately. Also it has been observed that processes over their lifetime do not utilize all of the memory they had initially requested. Thus over-committing allows the system to fully utilize it’s memory at the risk of out-of-memory (OOM) situations.

The purpose of OOM Killer is to find the best process to kill in case of severe memory shortage. The process is selected on the basis of badness score. The value of badness score is determined by the following properties:

  • original memory size of the process – more memory a process uses higher is its score
  • it’s CPU time
  • the run time – the longer a process is alive lower is its score
  • oom_adj value – The /proc/<pid>/oom_adj can be set to a value between -17 and +15. Higher the value, more likely is it to be selected as the sacrificial lamb. Setting this value to -17 instructs the OOM Killer to never kill the process.
  • Half of each child’s memory size is added to parent’s score.
  • If the task has a nice value above zero, the score doubles
  • Superuser or direct hardware access tasks have their values divided by 4
  • Depending on oom_adj the value is adjusted as:
    • if oom_adj > 0, score <<= oom_adj
    • if oom_adj < 0, score >>= -(oom_adj)

The principle on which the OOM Killer operates is :

System should lose the minimum amount of work done, recovers a large amount of memory, doesn’t kill innocent processes eating tons of memory and kills the minimum number of processes (limit to 1 if possible).

The task with the highest badness score is selected and all it’s children are killed. If the process does not have any child then the process itself will be killed.

Adding more info from OOM_KILLER.

The function which does the above mentioned badness score computation is called badness(). It gets called by the following chain:

_alloc_pages -> out_of_memory() -> select_bad_process() -> badness()

The badness() accumulates points for each process and returns them to select_bad_process(). The scoring of a process starts with the size of it’s resident memory:

        /*
* The memory size of the process is the basis for the badness.
*/
        points = p->mm->total_vm;
view raw badness_1.c This Gist brought to you by GitHub.

The memory size of any child is added to the process:

        /*
* Processes which fork a lot of child processes are likely
* a good choice. We add the vmsize of the childs if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of childs
*/
          ...
                  if (chld->mm != p->mm && chld->mm)
                        points += chld->mm->total_vm;
view raw badness_2.c This Gist brought to you by GitHub.

Process with nice value above zero have their score increased and long running processes have theirs decreased:

        s = int_sqrt(cpu_time);
        if (s)
                points /= s;
        s = int_sqrt(int_sqrt(run_time));
        if (s)
                points /= s;

        /*
* Niced processes are most likely less important, so double
* their badness points.
*/
        if (task_nice(p) > 0)
                points *= 2;
view raw badness_3.c This Gist brought to you by GitHub.

Superuser processes and direct hardware access tasks have their scores reduced:

        /*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
*/
        if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
                                p->uid == 0 || p->euid == 0)
                points /= 4;

        /*
* We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
        if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
                points /= 4;

view raw badness_4.c This Gist brought to you by GitHub.

Finally, honour the oom_adj setting:

        /*
* Adjust the score by oomkilladj.
*/
        if (p->oomkilladj) {
                if (p->oomkilladj > 0)
                        points <<= p->oomkilladj;
                else
                        points >>= -(p->oomkilladj);
        }
view raw badness_5.c This Gist brought to you by GitHub.

Thus the ideal candidate will be:

One that was recently started, is a non-privileged process which together with its children uses a lot of memory, has been nice’d and does no I/O. Something like a nohup’d parallel kernel build (which is not a bad choice since all results are saved to disk and very little work is lost when a make is terminated).

From the SDE Tip – Amazon

Popularity: 2% [?]

Your ClassNotFound Error Is Probably Not Telling You Everything

Quicktip
If a class fails to load due to an exception during class initialization, the actual problem is only logged the first time you attempt to load the class.  After the first time, the classloader recognizes that it’s already tried to load the class and just throws a ClassNotFoundException or NoClassDefFoundError.

Symptom
You will see logs for the ClassNotFoundException or NoClassDefFoundError (usually many such logs), but you see that the class is in the classpath.  You won’t see any root cause on any stack trace but the first (and that one is typically not a CNFE or NCDFE).

Finding the root cause
If you get a CNFE or NCDFE and you see the class in the classpath, search back your logs for ${missing classname}.<clinit> in a stack trace to figure out what prevented the class from being loaded.  Remember that the log may have rolled off.

Possible root causes for CNFE/NCDFE

  • Desired class is not in classpath (this is the boring case)
  • Initialization of desired class throws a RuntimeException or an Error
    • Static variable is initialized via a function which threw an uncaught Throwable
      • E.g. public static final String SOME_CONST = SomeClass.getString(“SOME_KEY”); where SomeClass.getString(String key) can throw an Exception.
    • Static block (loose code in {} in the class definition) threw an uncaught Throwable
      • E.g. public class Foo { { doSomeStaticInitialization(); } …
    • variable or method signature includes type which could not be initialized (see this same list of root causes)
      • E.g. public class Foo { SomeClassWithErrorInInitializer attr1; …

From The SDE Tip – Amazon

Popularity: 1% [?]

What’s in /proc?

Quicktip:

/proc has subdirectory for each running process as well as some subdirectories for various aspects of your hardware. You can learn lots of details about your cpu, your memory, processes running on your computer, and more by digging around in /proc.

A few things you’ll find in /proc:

  • /proc/cpuinfo: Info about your cpu
  • /proc/meminfo: Info about your RAM
  • In /proc/<pid>/:
    • cmdline – the command line used to start the process
    • cwd – a link to the current working directory of the process
    • environ – the environment variable for the process separated by nulls. Try “cat /proc/<pid>/environ | xargs -n 1 -o”
    • fd – directory containing links to file descriptors opened by the process, including files that have been deleted since the fd was created!

What it’s good for:

  • Recover a deleted file if some process is still writing to it (tail -n 999999999 -f /proc/<pid>/fd/<file handle> > ~/recovered-file; kill <pid>)
  • Check if your cpu is 32 or 64 bit
  • Look into the environment variables your process was started with.
  • See what directory a process is running in.
  • Quickly see the full command like of a process.
  • Much more – look around!

 

From The SDE Tip – Amazon

Popularity: 1% [?]

Hive UDF: Cannot Run Program – No Such File Or Directory

We are using Amazon EMR to run hive. I wrote up a perl script to carry out certain transformations. This script is stored in the s3. The script has executable permission for all users. However, when I use the script I get an error saying the program could not be run as no such file or directory was found!

I confirmed that hadoop did download the file and has all permissions set.

Baffled, I googled around to see if people have had this issue. And I found a match in one of the AWS Developer Forums - https://forums.aws.amazon.com/message.jspa?messageID=126905.

To quote:

Hadoop fetches your file from S3 and puts it in the distributed cache before starting the job.During this processs Hadoop flips the executable bit of the file off and thefile is no longer an executable in the distributed cache. The error message isa bit misleading, but you should be able to get it to work if you explicitlyinvoked PHP.

Taking the hint, I modified my hive ql to

And things now work like a charm.

Popularity: 1% [?]

Impossible Null Pointer Exceptions

I must admit this had never occurred to me before but after reading, the explanation seems so obvious!

You can get NullPointerException from lines in Java which appear to have no possibility of throwing them, such as;

this.setCount(num)

The null pointer exception can come about when num is of type Integer and setCount takes an int parameter. Java’s auto-boxing will automatically call num.intValue(), and if num is null you get an exception.

Of course, the fix is to check for null-ness and treat it however the semantics of your operation requires.

From the SDE Tip – Amazon

Popularity: 1% [?]

Find Files Of A Given Name Instantly

Usually when I have to find a file, I use a find and a grep command together – because I still haven’t learnt to use find properly. This is what I do:

find -L . | grep <name>

But this process is usually slow as it traverses the whole directory structure recursively following the symlinks to find all the files and then greps for the name that I need.

Recently I learnt about a faster way of doing the same – use locate to find files of a given name quickly. The command is:

locate <name>

It creates an index of all the files on the system and searches off it when called for. The index is rebuilt (by default) once a day by a cron’d find so results maybe stale by 24 hours.

If locate is not already present on your system, you could install it by:

sudo yum install mlocate

From the SDE Tip – Amazon

Popularity: 1% [?]

Save File With Sudo Permissions In VIM

A lot of time I open a file, make changes and when the time comes to save it realize that I hadn’t sudo’ed it. Learnt that the following command helps:

:w ! sudo tee %

Here’s the reason:

:w tells the vi to save modification of the file

! sudo tee executes the tee command with sudo permissions taking it’s input from :w

% here means the current file name

From the SDE Tip – Amazon

Popularity: 1% [?]