Discussion:
memory fragmentation with malloc
(too old to reply)
G. Maly
2007-07-30 14:34:05 UTC
Permalink
Consider an application that does many memory allocations via malloc/new
(bkr/sbrk). Each of the allocations is less than the page size. The
application frees the most of the allocated blocks, but some of the latter
allocated blocks are still in use, so that memory consumption remains high -
a typical case of memory fragmentation.

Possible solution could be usage of a malloc-replacement library. Does any
of these libraries call internally, for example, mmap instead of brk, even
for small blocks?

Does anybody of you have positive experience with such a library which:
- runs on Linux/x86-64
- helps to avoid fragmentation
- has good performance in multi-threaded environment
- ideally, doesn't need any changes in the application's code (dyn.load via
LD_PRELOAD)
- is stable ;)

Thanks,
Gennady
Spoon
2007-07-30 15:06:26 UTC
Permalink
Post by G. Maly
Consider an application that does many memory allocations via malloc/new
(bkr/sbrk). Each of the allocations is less than the page size. The
application frees the most of the allocated blocks, but some of the latter
allocated blocks are still in use, so that memory consumption remains high -
a typical case of memory fragmentation.
Possible solution could be usage of a malloc-replacement library. Does any
of these libraries call internally, for example, mmap instead of brk, even
for small blocks?
- runs on Linux/x86-64
- helps to avoid fragmentation
- has good performance in multi-threaded environment
- ideally, doesn't need any changes in the application's code (dyn.load via
LD_PRELOAD)
- is stable ;)
AFAIU, one can tweak / tune glibc malloc.

cf. /usr/include/malloc.h

/* mallopt options that actually do something */
#define M_TRIM_THRESHOLD -1
#define M_TOP_PAD -2
#define M_MMAP_THRESHOLD -3
#define M_MMAP_MAX -4
#define M_CHECK_ACTION -5
#define M_PERTURB -6

also cf. malloc.c
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/malloc/malloc.c?rev=1.180&cvsroot=glibc

mallopt(int parameter_number, int parameter_value)
Sets tunable parameters The format is to provide a
(parameter-number, parameter-value) pair. mallopt then sets the
corresponding parameter to the argument value if it can (i.e., so
long as the value is meaningful), and returns 1 if successful else
0. SVID/XPG/ANSI defines four standard param numbers for mallopt,
normally defined in malloc.h. Only one of these (M_MXFAST) is used
in this malloc. The others (M_NLBLKS, M_GRAIN, M_KEEP) don't apply,
so setting them has no effect. But this malloc also supports four
other options in mallopt. See below for details. Briefly, supported
parameters are as follows (listed defaults are for "typical"
configurations).

Symbol param # default allowed param values
M_MXFAST 1 64 0-80 (0 disables fastbins)
M_TRIM_THRESHOLD -1 128*1024 any (-1U disables trimming)
M_TOP_PAD -2 0 any
M_MMAP_THRESHOLD -3 128*1024 any (or 0 if no MMAP support)
M_MMAP_MAX -4 65536 any (0 disables use of mmap)
G. Maly
2007-07-30 15:27:47 UTC
Permalink
Post by Spoon
Post by G. Maly
Consider an application that does many memory allocations via malloc/new
(bkr/sbrk). Each of the allocations is less than the page size. The
application frees the most of the allocated blocks, but some of the
latter allocated blocks are still in use, so that memory consumption
remains high - a typical case of memory fragmentation.
Possible solution could be usage of a malloc-replacement library. Does
any of these libraries call internally, for example, mmap instead of brk,
even for small blocks?
- runs on Linux/x86-64
- helps to avoid fragmentation
- has good performance in multi-threaded environment
- ideally, doesn't need any changes in the application's code (dyn.load
via LD_PRELOAD)
- is stable ;)
AFAIU, one can tweak / tune glibc malloc.
cf. /usr/include/malloc.h
/* mallopt options that actually do something */
#define M_TRIM_THRESHOLD -1
#define M_TOP_PAD -2
#define M_MMAP_THRESHOLD -3
#define M_MMAP_MAX -4
#define M_CHECK_ACTION -5
#define M_PERTURB -6
also cf. malloc.c
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/malloc/malloc.c?rev=1.180&cvsroot=glibc
mallopt(int parameter_number, int parameter_value)
Sets tunable parameters The format is to provide a
(parameter-number, parameter-value) pair. mallopt then sets the
corresponding parameter to the argument value if it can (i.e., so
long as the value is meaningful), and returns 1 if successful else
0. SVID/XPG/ANSI defines four standard param numbers for mallopt,
normally defined in malloc.h. Only one of these (M_MXFAST) is used
in this malloc. The others (M_NLBLKS, M_GRAIN, M_KEEP) don't apply,
so setting them has no effect. But this malloc also supports four
other options in mallopt. See below for details. Briefly, supported
parameters are as follows (listed defaults are for "typical"
configurations).
Symbol param # default allowed param values
M_MXFAST 1 64 0-80 (0 disables fastbins)
M_TRIM_THRESHOLD -1 128*1024 any (-1U disables trimming)
M_TOP_PAD -2 0 any
M_MMAP_THRESHOLD -3 128*1024 any (or 0 if no MMAP support)
M_MMAP_MAX -4 65536 any (0 disables use of mmap)
Yes, malloc is tunable. But the parameter M_MMAP_THRESHOLD (internal
brk/mmap allocation switch) does work only for allocated block size >
pagesize.
It doesn't help if the application allocates many small blocks.
Paul Pluzhnikov
2007-07-30 17:02:40 UTC
Permalink
Post by G. Maly
Consider an application that does many memory allocations via malloc/new
(bkr/sbrk). Each of the allocations is less than the page size. The
application frees the most of the allocated blocks, but some of the latter
allocated blocks are still in use, so that memory consumption remains high -
a typical case of memory fragmentation.
It sounds like a pool allocator may solve your problem nicely,
especially if you can separate "blocks that are likely to be free()d
soon" from "blocks that are likely to persist" into different pools.
Post by G. Maly
Possible solution could be usage of a malloc-replacement library. Does any
of these libraries call internally, for example, mmap instead of brk, even
for small blocks?
Unless you are allocating small number of small blocks (which can't
be the case -- you wouldn't have had the problem then), the
above solution will likely exhaust your VM long before you run
into the "fragmentation" problem. Also note that the total number
of distinct mmap()s is likely limited by the OS.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
G. Maly
2007-07-31 08:13:56 UTC
Permalink
Post by Paul Pluzhnikov
Post by G. Maly
Consider an application that does many memory allocations via malloc/new
(bkr/sbrk). Each of the allocations is less than the page size. The
application frees the most of the allocated blocks, but some of the latter
allocated blocks are still in use, so that memory consumption remains high -
a typical case of memory fragmentation.
It sounds like a pool allocator may solve your problem nicely,
especially if you can separate "blocks that are likely to be free()d
soon" from "blocks that are likely to persist" into different pools.
The application is supposed to be large and complicated, so the redesign of
its memory management would be difficult. Just to replace the malloc-library
with a more suitable one seems to be a cheaper solution.
Post by Paul Pluzhnikov
Post by G. Maly
Possible solution could be usage of a malloc-replacement library. Does any
of these libraries call internally, for example, mmap instead of brk, even
for small blocks?
Unless you are allocating small number of small blocks (which can't
be the case -- you wouldn't have had the problem then), the
above solution will likely exhaust your VM long before you run
into the "fragmentation" problem. Also note that the total number
of distinct mmap()s is likely limited by the OS.
I've tested it on Linux-64 and the number of mmap()s could be in either case
more than 1,000,000 :) It should be enough for the most of the
applications with a reasonable malloc-implementation, e.g. multiple
malloc-calls get memory from the same mmap-block
Paul Pluzhnikov
2007-07-31 14:54:45 UTC
Permalink
Post by G. Maly
Post by Paul Pluzhnikov
Also note that the total number
of distinct mmap()s is likely limited by the OS.
I've tested it on Linux-64 and the number of mmap()s could be in either case
more than 1,000,000 :)
Is that supposed to be a large number?

What was the "high water mark" for simultaneous allocations by the
time you are hitting the "memory fragmentation" problem?

I am guessing it's significantly more that 1M.
Post by G. Maly
It should be enough for the most of the
applications with a reasonable malloc-implementation, e.g. multiple
malloc-calls get memory from the same mmap-block
If multiple malloc()s get memory from the same mmap()ed block,
then you are right back to where you started (that's what glibc
already does, except from a large sbrk()d block) -- if *some*
of these malloc()ed blocks are later free()d, but some aren't,
the entire mmap()ed block can't be munmap()ed, and you are fragmented
again.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
G. Maly
2007-07-31 18:17:01 UTC
Permalink
Post by Paul Pluzhnikov
Post by G. Maly
Post by Paul Pluzhnikov
Also note that the total number
of distinct mmap()s is likely limited by the OS.
I've tested it on Linux-64 and the number of mmap()s could be in either case
more than 1,000,000 :)
Is that supposed to be a large number?
If 1 mmap-block = 128K, than 1 Mio of such blocks = 128 GB. It is supposed
to be enough for the application in my example.
Post by Paul Pluzhnikov
If multiple malloc()s get memory from the same mmap()ed block,
then you are right back to where you started (that's what glibc
already does, except from a large sbrk()d block)
No, glibc doesn't do it. It dosn't try to place more than 1
malloc-allocations in 1 mmap.
Large blocks aren't be sbrk()d, they will be mmap()d if you allocate memory
via malloc. "Large" is 128K (or 256K?) in the standard
malloc-implementation.
Post by Paul Pluzhnikov
-- if *some*
of these malloc()ed blocks are later free()d, but some aren't,
the entire mmap()ed block can't be munmap()ed, and you are fragmented
again.
The probability that a given (let's say <=128K) mmap-block will be
fragmented and could not be unmaped() should be less than the probability
that brk()d Arenas of many GBs can't be "freed" due to several still
allocated objects as it happens in the standard implementation now.

Well, my question is not "how to write a good malloc-implementation? mmap or
brk?"
but
"is there any alternative malloc-implementation which could help to avoid
fragmentation on linux?" e.g. in this code snippet, where just one (latest)
allocated block prevents memory consumtion reduction of the process although
the other 999999 blocks have been deleted.

const int n = 1000000;
const int sz = 1024;
char* blocks[n];
for (int i=1; i<=n; ++i)
blocks[i-1] = new char[sz];
for (int i=1; i<n; i+=1) //delete all blocks except the last one
delete blocks[i-1];
Paul Pluzhnikov
2007-08-01 05:38:49 UTC
Permalink
Post by G. Maly
"is there any alternative malloc-implementation which could help to avoid
fragmentation on linux?" e.g. in this code snippet, where just one (latest)
allocated block prevents memory consumtion reduction of the process although
the other 999999 blocks have been deleted.
You could try libumem (originally used in solaris kernel, then
ported to solaris user-space, and finally to other systems):
http://labs.omniti.com/trac/portableumem

I do know how stable the Linux port is.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
Paul Pluzhnikov
2007-08-01 05:42:44 UTC
Permalink
Post by Paul Pluzhnikov
I do know how stable the Linux port is.
s/do know/do not know/

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
G. Maly
2007-08-01 08:40:18 UTC
Permalink
Post by Paul Pluzhnikov
Post by G. Maly
"is there any alternative malloc-implementation which could help to avoid
fragmentation on linux?" e.g. in this code snippet, where just one (latest)
allocated block prevents memory consumtion reduction of the process although
the other 999999 blocks have been deleted.
You could try libumem (originally used in solaris kernel, then
http://labs.omniti.com/trac/portableumem
"...provides faster and more efficient memory allocation by using an object
caching strategy..."

seems to be optimized in terms of speed. umem will assume that the 999999
freed blocks will be used again soon and cache them, but not give them back
to the OS. isn't it?
regards,
Gennady
Giorgos Keramidas
2007-08-01 09:08:18 UTC
Permalink
Post by G. Maly
Post by Paul Pluzhnikov
Post by G. Maly
"is there any alternative malloc-implementation which could help to
avoid fragmentation on linux?" e.g. in this code snippet, where just
one (latest) allocated block prevents memory consumtion reduction of
the process although the other 999999 blocks have been deleted.
You could try libumem (originally used in solaris kernel, then
http://labs.omniti.com/trac/portableumem
"...provides faster and more efficient memory allocation by using an
object caching strategy..."
seems to be optimized in terms of speed. umem will assume that the
999999 freed blocks will be used again soon and cache them, but not
give them back to the OS. isn't it?
libumem is a bit more complex than the description above.

There is an integrated 'cache reaping' mechanism, which runs either on a
time-based interval or when the heap needs to be grown. This cache
reaping mechanism tries to free up enough resources to satisfy
allocation requests from already freed (but cached) objects.

So libumem won't give freed (but cached) areas "back to the OS", but it
will happily reuse them to satisfy future allocation requests blazingly
fast.
G. Maly
2007-08-01 09:57:28 UTC
Permalink
Post by Giorgos Keramidas
Post by G. Maly
seems to be optimized in terms of speed. umem will assume that the
999999 freed blocks will be used again soon and cache them, but not
give them back to the OS. isn't it?
libumem is a bit more complex than the description above.
There is an integrated 'cache reaping' mechanism, which runs either on a
time-based interval or when the heap needs to be grown. This cache
reaping mechanism tries to free up enough resources to satisfy
allocation requests from already freed (but cached) objects.
So libumem won't give freed (but cached) areas "back to the OS", but it
will happily reuse them to satisfy future allocation requests blazingly
fast.
It will not reuse them happily if the application will NOT need such large
amount of memory any more or at least in the next time.
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.

It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Rainer Weikusat
2007-08-01 11:01:58 UTC
Permalink
Post by G. Maly
Post by Giorgos Keramidas
Post by G. Maly
seems to be optimized in terms of speed. umem will assume that the
999999 freed blocks will be used again soon and cache them, but not
give them back to the OS. isn't it?
libumem is a bit more complex than the description above.
There is an integrated 'cache reaping' mechanism, which runs either on a
time-based interval or when the heap needs to be grown. This cache
reaping mechanism tries to free up enough resources to satisfy
allocation requests from already freed (but cached) objects.
So libumem won't give freed (but cached) areas "back to the OS", but it
will happily reuse them to satisfy future allocation requests blazingly
fast.
It will not reuse them happily if the application will NOT need such large
amount of memory any more or at least in the next time.
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Ehh ... you can write a basic slab cache implementation (w/o support
for multithreading) in 139 lines of code (that's the smaller of the
two I have done so far). Additionally, you will need a really simple
'page allocator' that gets 'free pages' from the kernel via mmap. If
this allocation scheme suits you (ie you are allocating 'large
quantities' of objects with 'few' different and fixed sizes) that
should be easy enough to do.
G. Maly
2007-08-01 14:35:25 UTC
Permalink
Post by Rainer Weikusat
Post by G. Maly
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Ehh ... you can write a basic slab cache implementation (w/o support
for multithreading)
I need support for multithreading
Post by Rainer Weikusat
in 139 lines of code (that's the smaller of the
two I have done so far). Additionally, you will need a really simple
'page allocator' that gets 'free pages' from the kernel via mmap. If
this allocation scheme suits you (ie you are allocating 'large
quantities' of objects with 'few' different and fixed sizes) that
should be easy enough to do.
I don't know what a 139-liner can do, but as e.g. Spoon wrote in his answer
the standard malloc implementation can be seen here
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/malloc/malloc.c?rev=1.180&cvsroot=glibc
and it does have more than 139 lines of code. And libumem as well.

I am not sure it is easy to write a high-performance malloc replacement,
that's why I'm looking for such already available projects
Rainer Weikusat
2007-08-02 10:12:53 UTC
Permalink
Post by G. Maly
Post by Rainer Weikusat
Post by G. Maly
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Ehh ... you can write a basic slab cache implementation (w/o support
for multithreading)
I need support for multithreading
Then include it, perhaps?
Post by G. Maly
Post by Rainer Weikusat
in 139 lines of code (that's the smaller of the
two I have done so far). Additionally, you will need a really simple
'page allocator' that gets 'free pages' from the kernel via mmap. If
this allocation scheme suits you (ie you are allocating 'large
quantities' of objects with 'few' different and fixed sizes) that
should be easy enough to do.
I don't know what a 139-liner can do,
I told you in the paragraph above: It is an implementation of the 'v1
slab allocator', as described in the original Bonwick-paper, but
without the per-cache mutex.
Post by G. Maly
but as e.g. Spoon wrote in his answer
the standard malloc implementation can be seen here
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/malloc/malloc.c?rev=1.180&cvsroot=glibc
and it does have more than 139 lines of code.
There is no such thing as 'the standard malloc implementation'. The link
above presumably refers to the ptmalloc-implementation distributed
with glibc, which is a dlmalloc-modification. 'malloc'-implementations
have a tendency to be 'fairly large' because they attempt to solve a
problem which has no known theoretical solution. At least I am fairly
convinced that it this is actually a blind alley, meaning, there will
never be a solution.
Post by G. Maly
I am not sure it is easy to write a high-performance malloc
replacement,
I'd suggest that you either try to familiarize yourself with available
literature on this topic (eg for the 'use case' for slab caches, it is
really easy) or trust people who have.
Giorgos Keramidas
2007-08-01 19:32:54 UTC
Permalink
Post by G. Maly
Post by Giorgos Keramidas
Post by G. Maly
seems to be optimized in terms of speed. umem will assume that the
999999 freed blocks will be used again soon and cache them, but not
give them back to the OS. isn't it?
libumem is a bit more complex than the description above.
There is an integrated 'cache reaping' mechanism, which runs either on a
time-based interval or when the heap needs to be grown. This cache
reaping mechanism tries to free up enough resources to satisfy
allocation requests from already freed (but cached) objects.
So libumem won't give freed (but cached) areas "back to the OS", but it
will happily reuse them to satisfy future allocation requests blazingly
fast.
It will not reuse them happily if the application will NOT need such
large amount of memory any more or at least in the next time.
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Since cache reaping is not based only on memory demand/pressure but it
is also time-triggered after a configurable period of seconds, libumem
*will* return unused memory from its cache after the time-based reap
happens.

Doesn't that cover the (1) and (2) phases described above?
G. Maly
2007-08-02 10:25:30 UTC
Permalink
Post by Giorgos Keramidas
Post by G. Maly
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Since cache reaping is not based only on memory demand/pressure but it
is also time-triggered after a configurable period of seconds, libumem
*will* return unused memory from its cache after the time-based reap
happens.
hello, Giorgos
How can I set the configurable period of seconds?
Will unused memory be returned even after malloc()/free() or just if
libumem-specifially umem_alloc(),.. were called in the application?

I've tested with LD_PRELOAD="libumem.so" and couldn't notice any
improvements compared to glibc

Regards,
Gennady
Giorgos Keramidas
2007-08-02 12:40:16 UTC
Permalink
Post by G. Maly
Post by Giorgos Keramidas
Post by G. Maly
Assume the application which runs
1) short phases with large memory consumption and
2) long phases with small memory consumption.
It would be preferable if VirtMemory would be reduced during phase 2.
The question was which memory allocator does support it?
Since cache reaping is not based only on memory demand/pressure but it
is also time-triggered after a configurable period of seconds, libumem
*will* return unused memory from its cache after the time-based reap
happens.
hello, Giorgos
How can I set the configurable period of seconds?
Will unused memory be returned even after malloc()/free() or just if
libumem-specifially umem_alloc(),.. were called in the application?
'Unused memory' in this case is memory which you have allocated with
malloc(), released back to the C allocator with free(), and libumem has
cached for later use.
Post by G. Maly
I've tested with LD_PRELOAD="libumem.so" and couldn't notice any
improvements compared to glibc.
What would you consider an 'improvement'? How did you measure the
differences between glibc and libumem?

Note that to get the best out of libumem, you may have to set one or
more libumem options. The public API of the library is documented in
the manpages:

libumem (3lib) - object-caching memory allocation library
umem_alloc (3malloc) - fast, scalable memory allocation
umem_cache_create (3malloc) - allocation cache manipulation
umem_debug (3malloc) - debugging features of the umem library
G. Maly
2007-08-02 14:57:59 UTC
Permalink
Post by Giorgos Keramidas
'Unused memory' in this case is memory which you have allocated with
malloc(), released back to the C allocator with free(), and libumem has
cached for later use.
Post by G. Maly
I've tested with LD_PRELOAD="libumem.so" and couldn't notice any
improvements compared to glibc.
What would you consider an 'improvement'? How did you measure the
differences between glibc and libumem?
During the libumem installation 2 shared libs were created: libumem.so and
libumem_malloc.so
As I am not sure which of them is the right one for malloc replacement, I've
tried the both.

LD_PRELOAD="libumem_malloc.so" caused application's abort in the first
malloc() called.

LD_PRELOAD="libumem.so" works but mem.consumption is the same as with glibc.
Possibly, libumem.so is not a malloc-replacement lib at all, but a debug
tool for mem-leaks?

I've tested with UMEM_OPTIONS=backend=mmap and =sbrk as well
Rainer Weikusat
2007-08-02 15:14:09 UTC
Permalink
"G. Maly" <***@compuserve.com> writes:

[...]
Post by G. Maly
LD_PRELOAD="libumem.so" works but mem.consumption is the same as with glibc.
Possibly, libumem.so is not a malloc-replacement lib at all, but a debug
tool for mem-leaks?
Is there a particular reason why you are so extremely fond of making a
fool of yourself in public? 'libumem' is the userspace version of Jeff
Bonwick's 'slab allocator'. For fairly obvious reason, replacing the
allocator does cause a program using an allocator to allocate less
memory.

Link to the original 'slab allocator' paper:

<URL:http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html>

'v2', including a description of 'libumem':

<URL:http://www.usenix.org/event/usenix01/bonwick.html>
Rainer Weikusat
2007-08-02 16:17:30 UTC
Permalink
"G. Maly" <***@compuserve.com> writes:

[...]
Post by G. Maly
LD_PRELOAD="libumem.so" works but mem.consumption is the same as with glibc.
Possibly, libumem.so is not a malloc-replacement lib at all, but a debug
tool for mem-leaks?
Is there a particular reason why you are so extremely fond of making a
fool of yourself in public? 'libumem' is the userspace version of Jeff
Bonwick's 'slab allocator'. For fairly obvious reason, replacing the
allocator does not cause a program using an allocator to allocate less
memory.

Link to the original 'slab allocator' paper:

<URL:http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html>

'v2', including a description of 'libumem':

<URL:http://www.usenix.org/event/usenix01/bonwick.html>

Loading...