Discussion:
GNU malloc and external fragmentation
(too old to reply)
Rainer Weikusat
2021-07-13 19:18:20 UTC
Permalink
I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.

In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.

The code will possibly also run on a modern "embedded system"/ SOHO
router. While these are pretty powerful nowadays, 15M of RAM getting
sucked into malloc for no particular purpose seems excessive to
me.
Marcel Mueller
2021-07-13 20:46:37 UTC
Permalink
Post by Rainer Weikusat
I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM.
Too much for reasonable RAM cache efficiencies. You should consider more
efficient data structures.

I wrote something like that about 25 years ago (domain & URL blacklists
& whitelists for security critical environments).
Trie (not tree!) structures could significantly reduce the amount of
memory and avoid too much random access to the memory. You could use
single characters or even domain components like the TLD as atoms in the
trie.
Furthermore tree lookups have high cache efficiency (compared to hash
tables) since every search starts at the same root.
In fact you need hybrid structures that decay to ordinary B trees at
deeper levels to avoid large overhead on long names.

Trie and B tree structures can also be combined. E.g. you could use
chunks of 4 bytes as trie atoms, e.g. ".com" or "n.de" in the top level
This 32 bit values could be directly stored in the B tree nodes used for
dispatch tables. With a node size of 63 or 127 you could get only a few
bits overhead per B tree entry, since an 8 bit value is sufficient for
node size and node type (node/leaf). And redundant trailing domain parts
are still deduplicated due to the trie structure. The next level
contains only the previous 4 bytes of the domain name.
Post by Rainer Weikusat
A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.
Do you allocate the storage as a single chunk? (not recommended)
Or did you allocate every string individually? Then you get a
considerable overhead.
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
Post by Rainer Weikusat
The code will possibly also run on a modern "embedded system"/ SOHO
router.
Then you should /really/ optimize the data structures. The CPUs of these
systems are usually not that fast. And even small delays of DNS queries
are annoying.
Post by Rainer Weikusat
While these are pretty powerful nowadays, 15M of RAM getting
sucked into malloc for no particular purpose seems excessive to
me.
34MB for a domain blacklist on an embedded device sounds unreasonably to
me too. At least you should ensure that only a small part of this memory
is "hot", i.e. used for most of the comparisons for performance reasons.


Marcel
M***@r7hwq328lx.tv
2021-07-14 07:24:29 UTC
Permalink
On Tue, 13 Jul 2021 22:46:37 +0200
Post by Marcel Mueller
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
I used mmap on Linux in a compression utility recently that had to scan the
same file a number of times because I assumed it would be faster than normal
I/O. Turned out I was wrong and in fact using open(), read() etc was about
twice as fast which I still find confusing because I thought all the standard
C I/O used mmap (or its kernel hook) eventually anyway. Strange.
Richard Kettlewell
2021-07-14 07:32:42 UTC
Permalink
Post by M***@r7hwq328lx.tv
Post by Marcel Mueller
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable
size and putting the structures and domain names there just wastes
2400 bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.
I did a similar experiment a couple of decades ago, with similar
results.

The reason, as I understand it, is that changing your process’s virtual
memory mapping is relatively expensive, even compared to the copying
involved in read().
--
https://www.greenend.org.uk/rjk/
M***@swzygk2or5gveo8v.gov
2021-07-14 08:16:52 UTC
Permalink
On Wed, 14 Jul 2021 08:32:42 +0100
Post by Richard Kettlewell
Post by M***@r7hwq328lx.tv
Post by Marcel Mueller
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable
size and putting the structures and domain names there just wastes
2400 bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.
I did a similar experiment a couple of decades ago, with similar
results.
The reason, as I understand it, is that changing your process’s virtual
memory mapping is relatively expensive, even compared to the copying
involved in read().
Ah ok, that makes sense but its a bit of a shame.
James K. Lowden
2021-07-15 16:38:19 UTC
Permalink
On Wed, 14 Jul 2021 08:32:42 +0100
Post by Richard Kettlewell
Post by M***@r7hwq328lx.tv
I used mmap on Linux in a compression utility recently that had to
scan the same file a number of times because I assumed it would be
faster than normal I/O. Turned out I was wrong and in fact using
open(), read() etc was about twice as fast which I still find
confusing because I thought all the standard C I/O used mmap (or its
kernel hook) eventually anyway. Strange.
I did a similar experiment a couple of decades ago, with similar
results.
The reason, as I understand it, is that changing your process?s
virtual memory mapping is relatively expensive, even compared to the
copying involved in read().
My working hypothesis is that the cost of rearranging the process's
memory exceeds the cost of copying the data to existing memory N times,
where N = 1. For some value of N, the rearrangement cost is
amortized, and total I/O costs minimized using mmap.

I'm relying on empirical evidence. Linux uses mmap for executables, so
it can't be terrible. And LMDB has had great success with performance
using memory-mapped stores.

--jkl
Marcel Mueller
2021-07-15 05:37:42 UTC
Permalink
Post by M***@r7hwq328lx.tv
On Tue, 13 Jul 2021 22:46:37 +0200
Post by Marcel Mueller
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
I used mmap on Linux in a compression utility recently that had to scan the
same file a number of times because I assumed it would be faster than normal
I/O. Turned out I was wrong and in fact using open(), read() etc was about
twice as fast which I still find confusing because I thought all the standard
C I/O used mmap (or its kernel hook) eventually anyway. Strange.
The problem with mmap is, that it does not read anything. You will get a
page fault every 4kB and so the Data is read in 4k Chunks, allocating
physical memory in the same chunk size. The latter could cause
significant fragmentation of the physical memory leading to many TLB
entries. And the latter are a limited resource. If your thread needs
more that the hardware provides you will get further page faults.
Reading data in 4k chunks was inefficient 30 years ago and is even more
inefficient nowadays. Page faults are expensive too. You loose in the
order of 10³ clock cycles.
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.


Marcel
M***@7a31.net
2021-07-15 08:35:33 UTC
Permalink
On Thu, 15 Jul 2021 07:37:42 +0200
Post by Marcel Mueller
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.
In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the
file (unless it does it in chunks) so are there other kernel hooks it
uses?
Scott Lurndal
2021-07-15 14:05:40 UTC
Permalink
Post by M***@7a31.net
On Thu, 15 Jul 2021 07:37:42 +0200
Post by Marcel Mueller
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.
In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the
file (unless it does it in chunks) so are there other kernel hooks it
uses?
In the normal case, read() will issue a request through the I/O subsystem
which will find its way to the filesystem-specific code, which will
read the 4k (on intel, 4/16/64k on armv8) block containing the required
data into an unused page of memory. Subsequent reads will be be satified
from that page until the read crosses a page boundary, in which case the
kernel will read the next page of data from the file and the read will
be satisfied from that. This is generally known as the page cache,
the size of which varies from a configured minimum to as much memory
as is currently not used for any other purpose. Pages that
exceed the configured minimum will be reclaimed if needed for
e.g. fork() of a new process, with those in the configured minimum
subject to a replacement algorithm (e.g. LRU).

O_DIRECT can be used to bypass the page cache and read data directly
into the application address space.
Richard Kettlewell
2021-07-15 16:09:54 UTC
Permalink
Post by M***@7a31.net
Post by Marcel Mueller
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.
In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the
file (unless it does it in chunks) so are there other kernel hooks it
uses?
Scott has given you most of the answer. The connection to memory mapping
is that when you memory map (part of) a file, the physical RAM used for
that mapping is same RAM that is used for the kernel’s page cache for
the corresponding part of the file (with some caveats about private
mappings).
--
https://www.greenend.org.uk/rjk/
Scott Lurndal
2021-07-15 17:22:10 UTC
Permalink
Post by Richard Kettlewell
Post by M***@7a31.net
Post by Marcel Mueller
In contrast you probably did not use read with only 4kb chunks. But even
if you did so the file system cache has optimizations for sequential
read and uses read ahead.
In that case does anyone know how read() etc on Linux actually access the
file information? Seems like I was wrong about the I/O library mapping the
file (unless it does it in chunks) so are there other kernel hooks it
uses?
Scott has given you most of the answer. The connection to memory mapping
is that when you memory map (part of) a file, the physical RAM used for
that mapping is same RAM that is used for the kernel’s page cache for
the corresponding part of the file (with some caveats about private
mappings).
And some caveats about huge pages. The linux mmap() allows the
application to specify the use of legacy huge pages (2m on x86_64, 4m on ia32)
for the mapping, in which case the mapping uses physical memory reserved
at boot time. (TLB entries for huge pages must be aligned to the page
size, which makes creating a properly aligned huge page out of a 4k
pool of pages at run time a very difficult (and costly) proposition).

More recently, linux added THP (Transparent Huge Pages) which will
opportunistically use huge pages when circumstances allow.

Using large pages when appropriate can significantly improve the
TLB hit rate (table walks, particularly in Virtual Machines) can
be quite expensive.

Scott Lurndal
2021-07-15 14:00:02 UTC
Permalink
Post by Marcel Mueller
Post by M***@r7hwq328lx.tv
On Tue, 13 Jul 2021 22:46:37 +0200
Post by Marcel Mueller
Post by Rainer Weikusat
In contrast to this, using mmap to allocate an area of a suitable size
and putting the structures and domain names there just wastes 2400
bytes.
mmap operates directly on MMU-Level, typically allocating 4k Pages.
It is efficient if it is not used too often or for too small chunks.
I used mmap on Linux in a compression utility recently that had to scan the
same file a number of times because I assumed it would be faster than normal
I/O. Turned out I was wrong and in fact using open(), read() etc was about
twice as fast which I still find confusing because I thought all the standard
C I/O used mmap (or its kernel hook) eventually anyway. Strange.
The problem with mmap is, that it does not read anything. You will get a
page fault every 4kB and so the Data is read in 4k Chunks, allocating
physical memory in the same chunk size.
(1) the underlying page size varies by processor (e.g. 4k, 16k or 64k on
ARMv8) and arm/intel/amd processors can configure larger block
sizes that will consume a single TLB entry per block (4M on 32-bit x86,
2M/1G on 64-bit x86_64, and a wide range of sizes on ARMv8).

And mmap on linux allows one to map using large pages.

(2) one can use madvise to tell the kernel that the access
is sequential allowing the kernel to both pre-read successive
pages before the page fault and invalidate the TLB entry for
the prior page.

(3) The transparent Huge Pages feature on linux will automatically
use large pages when conditions allow.
The latter could cause
Post by Marcel Mueller
significant fragmentation of the physical memory leading to many TLB
entries.
The size of the TLB(s) is fixed. There are so many 4K I/DTLBs
associated with the L1 cache, and so many 4k/2M/4M/1G TLB entries
associated with the L2 cache. They're filled on first reference
and replaced typically using a form of LRU replacement algorithm.

Check processor specific documentation for the size of the TLBs on
each processor.

The number of pages used for the mmap is irrelevent with respect
to TLB pressure - the access pattern is far more relevent.
Tavis Ormandy
2021-07-13 21:14:51 UTC
Permalink
Post by Rainer Weikusat
I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.
You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?

If it's the former, it sounds about right. I don't think this is
fragmentation, just the allocator chunk metadata overhead.

Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead,
probably the two size_t's on x64.

https://sourceware.org/glibc/wiki/MallocInternals#What_is_a_Chunk.3F

Tavis.
--
_o) $ lynx lock.cmpxchg8b.com
/\\ _o) _o) $ finger ***@sdf.org
_\_V _( ) _( ) @taviso
Nicolas George
2021-07-14 09:55:01 UTC
Permalink
Post by Tavis Ormandy
If it's the former, it sounds about right. I don't think this is
fragmentation, just the allocator chunk metadata overhead.
Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead,
probably the two size_t's on x64.
Almost exactly. A few checks show that GNU malloc has a 8 chars = 1 size_t
overhead on x86_64, but it guarantees an alignment of 16.

That means if you allocate between 25 and 40 chars inclusive, the overhead
adds to between 33 and 48, and it gets rounded up to 48.
Rainer Weikusat
2021-07-14 14:11:16 UTC
Permalink
Post by Tavis Ormandy
Post by Rainer Weikusat
I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.
You mean each entry is a calloc(), or you do calloc(1118228, sizeof(entry))?
If it's the former, it sounds about right. I don't think this is
fragmentation, just the allocator chunk metadata overhead.
Each chunk needs some overhead to hold the size/flags and prev size. It
seems like each entry is about 31 bytes (34M / 1118228), and you say
(49M / 1118228) = ~45 bytes are used, so that's ~14 bytes of overhead,
probably the two size_t's on x64.
The entries are of different sizes although 32 byte is the by far most
common one. Judging from

When a chunk is in use by the application, the only data that's
"remembered" is the size of the chunk.

there should be an 8 byte overhead per chunk which accounts for about
half of the memory overhead.
Scott Lurndal
2021-07-14 15:02:06 UTC
Permalink
Post by Rainer Weikusat
I'm currently working on something supposed to enable matching names in
DNS queries against a large "list of bad domains you don't want to be
talking to". The test file I'm working with has 1,118,228
entries. Reading all of the information in it into memory needs a little
over 34M of RAM. A program using calloc to allocate the necessary
storage ends up with an >49M heap due to memory wasted by GNU malloc on
$something.
Seems like a perfect application for a tree structure by domain
component. . at the top, then TLD, then subs etc.

. -> com -> sun -> www

etc.


Lookups will be quick as well.
Loading...