As to why they're not always called directly, imagine some code like this:
int FooWithoutChecks(void *p);
int Foo(void *p) {
if (p == NULL) return -1;
return FooWithoutChecks(p);
}
In general the caller is expected to call Foo if they aren't sure if the pointer is nullable, or if they already know that pointer is not null (e.g. because they already checked it themselves) they can call FooWithoutChecks and avoid a null check that they know will never be true.
The naive way to emit assembly for this is to actually emit two separate functions, and have Foo call FooWithoutChecks the usual way. But notice that the FooWithoutChecks function call is a tail call, so the compiler can use tail call optimization. To do this it would inline FooWithoutChecks into Foo itself, so the compiler just emits code for Foo with the logic in FoowithoutChecks inlined into Foo. This is nice because now when you call Foo, you avoid a call/ret instruction, so you save two instructions on every call to Foo. But what if someone calls FooWithoutChecks? Simple, you just call at the offset into Foo just past the pointer comparison. This actually just works because Foo already has a ret instruction, so the call to FooWithoutChecks will just reuse the existing ret. This optimization also saves some space in the binary which has various benefits in and of itself.
The example here with the null pointer check is kind of contrived, but this kind of pattern happens a LOT in real code when you have a small wrapper function that does a tail call to another function, and isn't specific to pointer checks.
> Why should every function start with endbr64 command? Aren't functions usually called directly?
They're usually called directly, but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required.
> Also, is it required to insert endbr64 command after function calls (for return address)?
No, IBT is only for jmp and call. SS is the equivalent mechanism for ret.
> but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required
Then why not just have the compiler break down every non-static function into two blocks: a static function that contains all the logic, and a non-static function that just contains an IBT and a direct jump to the static function? (Or, better yet, place the non-static label just before the static one, and have the non-static fall through into the body of the static.) Then the static direct callsites won't have to pay the overhead of executing the IBT NOP.
The IBT NOP is "free" in that it will evaporate in the pipeline; it still has to be fetched and decoded to some extent, but it does not consume execution resources.
From a tooling perspective, what you're describing (two entrypoints for a function, the jump you mention is pointless) would require changes up and down the toolchain; it would affect the compiler, all linkers, all debuggers, etc. By contrast, just adding an additional instruction to the function prolog is relatively low-impact.
It's also worth noting that at the time code for a function is emitted, the compiler is not aware of whether the symbol will be exported and thus discoverable in some other module, or by symbol table lookup, so emitting the target instruction is essentially mandatory.
Doesn't seem like it'd be that difficult to make the change the other direction, i.e. keep endbr64 as-is as the default case, but if there's a direct jump/call to anywhere that starts with endbr64, offset the immediate by 4 bytes; could be done in any single stage of toolchain that has that info with no extra help. But yeah, quite low impact, might not even affect decode throughput & cache usage for at least one of the direct or indirect cases.
That's absolutely doable, just... How much is predicted unconditional jump slower/faster than ENDBR64? What's the ratio of virtual/static calls in real-world programs? And while your last proposal ("foo: endbr64; foo_internal: <code>") evades those questions, it raises up questions about maintaining function alignment (16 bytes IIRC? Is this even necessary today?) and restructuring the compiler to distinguish the inner/external symbol addresses. Plus, of course, somebody has to actually sit down and write the code to implement that, as opposed to just adding "if (func->is_escaping) emit_endbr(...);" at the beginning of the code that emits the object code for a function body.
It's not "executed" per se. It consumes space in the cache hierarchy, and a slot in the front-end decoder. It won't ever be issued, but depending on the microarchitecture in question it might result in an issue cycle having less occupancy than it might have had in the case where the subsequent instruction was available.
With that said, the first few instructions of a called function often stall due to stack pointer dependencies, etc. so the true execution cost is likely to be even smaller than the above might suggest.
C allows for any function to be called via a function pointer, and functions can be in different translation units, so the compiler can't simply assume that a function will never be called indirectly and has to pessimistically insert endbr64 in order to maintain a reasonable ABI.
And no, as I understand it, this is only for branch/calls not returns.
Well, if the function is marked "static", the compiler can actually check whether the function's address is taken in the current compilation unit or not and omit/emit ENDBR64 accordingly (passing pointers to static functions to code in another compilation units is legal, and should still work).
Good catch. Yeah, as long as the functions address is never taken the compiler has a lot of leeway with static functions; it can even avoid emitting code for them entirely if it can prove they're never called or if it's able to compute their results at compile-time.
Also, is it required to insert endbr64 command after function calls (for return address)?