Mario Kart 8 as a primary exploit for homebrew on the Wii U
This blog post should give you a rough insight into the implementation of the Mario Kart 8 exploit to be a primary entrypoint for homebrew. Thereby, both the technical details and the problems that came up during development should be discussed. This time I also want to tell you about the ideas that didn’t work, instead of just the one that works fine.
The beginning
At the beginning of this year Rambo6Glaz made just another implementation of the GX2, which uses a different a different PM4 packet to manipulate the kernel heap. This new implementation brought back the idea of implementing a kernel exploit inside a rop chain.
Beside the browser exploit and haxchi there are currently three other userland exploits.
- ROBChain, an exploit in the main character scripting of Super Smash Brothers Wii U
- an exploit in the network protocol of Mario Kart 8
- a savegame exploit in Donkey Kong Tropical Freeze.
But there is a problem with all of these exploits: None of them has access to the JIT-area. This means no access to an area in memory which is writeable and executable. This makes arbitrary code execution without a kernel exploit impossible.
Out of these exploits the Mario Kart 8 one is special. It can be run on a previously unmodified console and could be a potential primary entrypoint into the system. Because of this the focus went to the Mario Kart 8 exploit.
Exploiting the network protocol of Mario Kart 8
Back in 2018 Kinnay found a bug in the P2P protocol of Mario Kart 8. He released a PoC which could crash the console of someone who hosted a friend room displaying the message “rop chains are fun :)”. This initial implementation allows a (remote!) rop chain execution with maximum length of ~1000 bytes, more than enough to play around which different payloads.
The original repository
has detailed information about the exact bug and exploitation.
In summary it’s possible to achieve a 4 byte arbitrary write due to a bug in
parsing the “identification token”. That’s enough to manipulate a
vtable, and
turn a call of Md5Context::GetHashSize
into a memcpy to the stack,
effectively copying the content of another packet onto the stack, leading
to a rop chain execution.
The kernel exploit in a rop chain - theory
In theory implementating the kernel exploit in a rop chain doesn’t sound that hard. From the wiiuhaxx-common repository we already have rop gadgets we can re-use. This includes for example gadgets to call a function or write a value to a arbitrary address in memory. Detailed information about the kernel exploit can be found in part 4 of my “homebrew environment” blog series, but here is a quick overview:
- Place a fake heap entry into a specific address in memory
- Create a PM4 packet and send it to the GPU to override the “next id” on the kernel heap
- Register an
OSDriver
and hope it’s allocating memory using our fake heap entry placed in step 1 - Manipulate the “SaveArea” pointer in the
OSDriver
struct (which is now in userland memory) to point into the kernel data. - Use the
OSDriver_CopyToSaveArea
andOSDriver_CopyFromSaveArea
functions to get arbitrary read/write with kernel privileges.
This doesn’t really seem that complicated. It’s just a few function calls and it fits relatively easily into a 1000 byte rop chain. We also have the advantage that the address of the stack is consistent. This allows us to place data (like the fake heap entry or the pm4 packet) at the end of the rop chain and simply calculate their positions in memory beforehand. Rambo6Glaz talked about starting to implement the kernel exploit in the Mario Kart 8 exploit, and I thought I would give it a shot too. Using the existing gadgets and already knowing the kernel exploit in detail made me think this would be rather trivial and it would be done in maybe a few hours.
I started to play around with the exploit and tried to implement the kernel exploit step by step. Sometimes I had random crashes and testing was quite annoying. For each try you have restart the console, go online, open a friend room and send the payload. Then maybe also read the crash log by restarting again and firing up a CFW to access the crash log. In total each attempt took like at least 2-3 minutes.
Fast forward a few days. After many hours of testing and trying I still had nothing. But somehow this whole exploit was quite addicting, it started with one simple idea and ended (or didn’t end) everyday with “just more one try”. At the same time Rambo6Glaz was doing the same thing. Slowly but steady we got a better understanding of what’s going on. Eventually we got a working memory write using the kernel exploit, but something was still wrong. It turned out that the exploit was indeed sometimes working (or at least partially working), but only in like 20% of the tries. This made testing even more annoying. Each idea required at least 5 failed attempts to make sure the idea was wrong and it wasn’t just the exploit randomly failing.
At this point we had collected some facts that helped us understand:
- the kernel exploit did sometimes work, but only in rare cases
- the rop chain needed to have a specific length to be stable (otherwise you get really strange behaviour and crashes)
- the rop chain is running on core 2, but the main GX2 core is 1 (the kernel exploit expects to be run on the GX2 core…)
For me personally a unstable exploit was enough, I just wanted to finish this. Even if performing the exploit would require several attempt, I just wanted to see it working once, so I can finally spent my time on other projects.
Because of this I tried to split up the exploit into multiple rop chains which
need to be executed one after another. Fitting the kernel exploit in 1000
bytes is doable, but also bundling a real payload and copy/executing won’t
fit anymore. One challenge was to actually restart the game. But from reverse
engineering for HID to VPAD I knew
there was a function to force opening the home menu (OSSendAppSwitchRequest
),
and it indeed worked.
I also tried to improve the success rate of the kernel exploit by adding
some waiting. But every time I waited via OSSleepTicks
or added a
GX2DrawDone
the console crashed. Knowing the kernel exploit would work
only in rare cases I tried to think of a
solution to give the user feedback if the exploit was successful or had
failed. In a rop chain “code execution” is really limited, it’s only possible
to run existing chunk of code. Branches and loops are really hard (at least
I haven’t found a way yet to pull it off, I am not a rop chain expert though),
the only option I saw was to manipulate the rop chain itself. I placed a
OSFatal
at the end of the rop chain to make the console crash, but overriding
it with a OSExitThread
using the (hopefully) newly gained kernel write. This
way exiting would mean success and crashing would mean failure. I spent again
too much time on this but never really had anything working.
At this point more than a week and literally dozens of hours were already wasted on this, without much progress. It was time to change the strategy. Rambo6Glaz suggested to find a rop gadget to perform stack pivot to somehow have the possibility to execute a bigger rop chain.
Rop chain basics
Before working on this I’ve been working with rop chains, but I haven’t found
/written any rop gadgets myself. I wasn’t really understanding rop chains, I
was just using the “high level” functions from the wiiuhaxx_common
repository,
so it was time to dig deeper and learn something new.
(If you already are familiar with rop chains you can skip this part.)
Why do we need to use a rop chain? On the Wii U no region in the memory is executable and writeable at the same time (except for the JIT area, but we have no access to it in Mario Kart 8), so the idea is to use existing code. If you can control the stack, you can control the code flow. When calling a function, the position in the code of the “calling function” is saved on the stack. Whenever the called function returns, it jumps backs to the address which was saved on the stack. By manipulating this return address it’s possible to jump anywhere in the code. Using clever places in the code it’s possible to chain multiple of these jumps to execute needed instructions.
Functions which use the stack to store local variables have a common pattern. At the end of the function they are loading the saved return address from the stack and increase the stack pointer. By carefully crafting a stack we can jump to parts of code that are written directly before this pattern. Each address which you jump to is called a “gadget”.
Let’s imagine a stack where currently the stack pointer (r1) is pointing to address 0x20000000:
# Stack before running the gadget:
0x20000000: 0 <-- Stackpointer (r1)
0x20000004: 0x10000000 <-- current gadget address
0x20000008: 0 <-- Stackpointer (r1) + 0x08
0x2000000C: [NEW GADGETADDRESS] <-- Stackpointer (r1) + 0x0C
Now we assume that some function was just returning, setting the stack pointer
to 0x20000000
and reading the address where to jump to from 0x20000004
.
This means at this state the code flow continues at 0x10000000
, with r1 = 0x20000000
# Intructions of the gadget in 0x10000000
0x10000000: [SOME USEFUL INSTRUCTION 1]
0x10000004: [SOME USEFUL INSTRUCTION 2]
0x10000008: [SOME USEFUL INSTRUCTION 3]
0x1000000C: lwz r0, 0xc(r1); # load return address from stackpoint + 0x0c
0x10000010: mtlr r0; # move it to the link register (lr)
0x10000014: addi r1, r1, 8; # increase the stack pointer by 0x08
0x10000018: blr; # branch to link register
The first three instructions are be the ones we are really interested
in. Using these we want to achieve our planned behaviour. This could be for
example loading values into registers (from the stack, which we can control!),
moving values between registers, calling functions or writing values to memory
and much more.
The instructions from 0x1000000C
and 0x10000010
read the new return
address from the stack pointer + 0xC
, which is the value we’ve previously
put on the stack (0x2000000C
).
The instruction at 0x10000014
will increase the stack pointer by 0x08,
afterwards instruction 0x10000018
will branch to the link register which
was set in the previous instructions.
After executing the gadget this stack will look like this.
# Stack after running the gadget:
0x20000000: 0 <--
0x20000004: 0x10000000 <--
0x20000008: 0 <-- Stackpointer (r1)
0x2000000C: [NEW GADGETADDRESS] <-- current gadget address
[...] <-- stack data for the gadget in 0x2000000C
And a new gadget will be executed. This way chaining multiple gadgets is possible to achieve an intended behaviour.
How to find rop gadgets
There are several tools that help you find rop gadgets. I had the best luck
with the tool Ropper. Before you can use
Ropper
with Wii U binaries, you need to
convert them to ELF files. Ropper
allows you to display and filter all rop gadgets in a binary up to a
specified length.
Beside the actual binary of the exploited application you can also use rop gadgets of the system libraries (.rpl files). The “core” system libraries are always at the same location in the memory, which make them easily usable for rop gadgets. In fact it’s preferred to use gadgets from these executables to be independent of the application to be exploited.
Here is a list of all system libraries that are at a fixed position on memory and their location (.text section, FW 5.5.x+)
coreinit 101C400 - 1090F00
tve 1090F40 - 10B9BC0
nsysccr 10B9C00 - 10BFD40
nsysnet 10BFD80 - 10CFE60
uvc 10CFEC0 - 10D2120
tcl 10D2180 - 10ED6E0
dc 110D600 - 111FEC0
vpadbase 111FF00 - 1128840
vpad 1128880 - 113D5E0
avm 113D640 - 114EBE0
gx2 114EC40 - 11C3020
snd_core 11C3080 - 11E3820
It’s a good idea not to hardcore any of the addresses for rop gadgets, but instead get them from the binaries either via the ELFSymbols or a hash. For improving the browser exploit I built a small Java tool that will return a list of gadgets for a config file. This way the rop gadgets for different versions of the binary can easily be found.
Finding actual useful gadgets
After some research I finally knew enough to find a rop gadget on my own for
the first time. The goal was to perform a stack pivot to be able to switch to a
different (bigger!) stack. As we have learned in previous sections, the stack
pointer in stored in register r1
. To modify the stack pointer, we need to
find a gadget to modify r1
.
To achieve this, I searched for any gadget that writes a value into r1
without
any results. But I found a gadget that moves the content of r12
of r1
, so
I started searching for gadgets to control r12
, without any success. But I
found one that moves the content of r11
to r12
… and so on. You see
how this is going to end. The ultimate goal was to find a “chain”, that
starts reading a value from the stack and moves it over several gadgets into
r1
. In the end I really managed to find a working set of gadgets to perform
a stack pivot. It wasn’t the most gorgeous solution, but it worked. As the
project moved on was I able to improve and shorten the chain multiple times.
Beside having the rop size limitation, there was still the problem on being
the wrong CPU core. To switch the affinity of a thread, it needs to be
suspended. This mean it’s not possible for a thread to move itself to another
CPU core. The obvious solution is to create another thread with the affinity to run
on the target core. But there is one problem: The OSCreateThread
function
takes 9 arguments, but with exiting rop gadgets it’s only possible to call a
function with up to 6 arguments.
With motivation from the success of finding a stack pivot gadget, I was trying
to find a rop gadget to create a thread. For quite some time I tried to find a
gadget to call an arbitrary function with 9 arguments, but without success.
Then I realized that OSCreateThread
is just a wrapper for an internal
“create thread” function, where the function call is using register r25 to r31
as arguments instead of r3-r9
. In the PowerPC architecture arguments of a
function are stored before the call in registers r3 to r9
, setting these on
the end of a function is much more unlikely than the upper registers. The
“upper” registers (e.g r24 - r31
) are often saved on the stack at the
beginning of a function, and restored (loaded from the stack) at the end of a
function. The combination of having an OSCreateThread
gadget which loads
arguments from r25 to r31
and having an easy gadget to set these registers
makes this function call with a huge amount of arguments feasible.
How to execute long rop chains
At this point it was possible to do a stack pivot and create another thread on the right core. But there was still the problem of the size limited rop chain. Rambo6Glaz and I tried to figure out a way to allow bigger rop chains and came up with two different ideas:
- Create a rop chain to load a bigger chain via the network
- Split up the “final” rop chain into multiple chunks, run the exploit multiple
times and each time save one chunk inside a
OSDriver
.
While Rambo6Glaz focused on the network solution, I gave the OSDriver
idea a
shot.
Running the exploit multiple times!
The Wii U OS has a feature that allows libraries to install OSDrivers
.
Beside registering callbacks on certain events like acquiring or losing the
foreground, OSDrivers
can also store data inside the kernel. This is useful
to store permanent data that can be used even after restarting or switching
the application. Using the kernel syscalls
directly lets us bypass some checks and simplifies the usage.
Here is a general workflow of this idea:
- Run the exploit in Mario Kart 8 to get rop chain execution.
- Build a rop chain that registers a new
OSDriver
and stores embedded data (in this case a part of a big rop chain) inside the kernel usingCopyToSaveArea
. - Open the Home Menu via rop chain and exit the game.
- Go back to step 1 until the whole rop chain is placed in different
OSDrivers
- Build another rop chain that takes the data saved in the
OSDrivers
and execute it on a new thread on core 1 (GX2 main core in Mario Kart 8).
Using this approach I was able to store 816 bytes inside a OSDriver
with each
restart. I improved the rop chain generation to automatically take care of the
generation of all the different rop chains that are needed.
It worked quite well. Finally I could build a rop chain without thinking about the size limit. In fact the size of the final rop chain was limited by the amount of “read data from OSDriver X” gadgets, but I never reached it (~8000 bytes were possible). The downside: each try took quite long. I had to run the exploit at least three times to get the “final” rop chain running to check if it’s working. This leads to a > 5 minutes test cycle. For testing just some ideas it was enough, but on long term it was really annoying.
Using this I was able to test some ideas that were previously not possible due to size constraints. One of the first things I tried was to shutdown the GX2 engine and restart it again to have it in a clean state for the kernel exploit. This was now possible because we were on the right CPU core. But this resulted in a crash because the actual game was still running and using the GX2 engine. A simple solution was to suspend the main thread (which luckily is on a fixed address which can be easily obtained from the crash logs), and resume it at the end of the rop chain. Without resuming the main thread exiting the game wouldn’t be possible. But even with stopping the main thread and a reinitialization of the GX2 engine the exploit was still not working. Also adding some waiting in various variations didn’t help.
The best theory at the was that it didn’t work because something in the background was still running and using the GX2 engine, interfering with the exploit. At this point I was really desperate and tried to implement every single implementation in the rop chain, hoping one of them would actually work. But nothing was working.
From working on the plugin system I knew that threads on the CPU core 2 will
actually keep running when opening the Home Menu
. My idea was to perform
the exploit while the game was suspended in the background, but this also
didn’t work.
We need more gadgets!
Each application implements a ProcUI
loop. ProcUI
is a wrapper library
which allows an easier usage of the system message queue from Cafe OS
. The
ProcUI
loop is the place in the application where it’s decided if the
application is requested to move to the background, just gained the
foreground or should be closed. I thought by sending a “close application” to
the game and keeping our own thread running we would have a chance of running
the rop chain in a pretty clean environment without the actual game running and
interfering with it.
The easiest way to tell a game that it should be closed is by calling the
function SYSRelaunchTitle
from the sysapp
library, but actually using it was
way harder than I thought. In this blog post we’ve already talked about the
system libraries that are always at a fixed address in memory, but sysapp
is not one of them. The function address can be easily obtained using
OSDynLoad_Acquire
and OSDynLoad_FindExport
. The real problem is using
any of the return values and calling a function not by its address but by
a function address pointer.
To accomplish this once again more rop gadgets needed to be found. The function
OSDynLoad_FindExport
takes the module handle acquired via
OSDynLoad_Acquire
as its first argument, which dynamically changes after each
restart. So the first needed gadget was a function call where the first
argument is dereferenced from an address. In addition a gadget is needed to
call the function pointer that is returned using the OSDynLoad_FindExport
function.
After finding these gadgets it was finally possible to call SYSRelaunchTitle
to trigger a game shutdown, but it turns out it also kills any other existing
threads. The idea of keeping rop chain execution after shutting down the game
didn’t work either.
But these new gadgets really helped to test new things. For example we were
able to test the “magic” IM_SetDeviceState
call which is used in the
browser exploit to shutdown the browser. It turns out that just emulating
pressing the home button is not helping.
Loading bigger rop chains via the network!
The whole time I was using my slow “run the exploit multiple times to get a bigger rop chain”-approach, RamboGlaz6 was working on loading a second rop chain over the network.
At some point RamboGlaz6 finally managed to get a stable rop chain execution of a rop chain sent via TCP to the console. The workflow was something like this:
- Create a new thread on CPU core 1
- Inside the thread connect to a TCP server and receive a bigger rop chain
- Do a stack pivot to execute the received rop chain
- Profit!
This was really stable and massively sped up the testing of new rop chains.
Just keep GX2 running
Due to the faster testing I tried several new things. One of them was to
stop trying to shutdown and restart GX2 but still suspend the main thread of
Mario Kart 8. This lead to an exception in the kernel, so something was
happening. To perform the kernel exploit we place a fake heap entry and
modify the kernel heap to use this. The crash log suggested the kernel was
indeed trying to read from the right address, but the read data was not the one
we placed there. I wasn’t (and I am still not sure) if this was because of
some weird caching issue, but I went the safe route and modified the exploit
to read the fake heap entry from 0x2F200014
instead of 0x1F200014
and it
worked first try.
I gave it a few more shots and it was indeed stable. Finally.
From now on we had a stable kernel exploit which granted us read/write access with kernel privileges. The JIT-area isn’t just helpful for providing easy userland code execution, but also provides easy kernel execution. It’s also the only region in memory which allows write and execute for the kernel, but we still had no access to this region.
Without kernel execution and the default memory mapping there isn’t really anything special you can do with kernel privileged writes, only modifying the kernel .data section and registering a new syscall. Without being able to run custom code a new syscall isn’t that helpful. But kernel write is enough to change the tables inside the kernel which are used for the memory mapping and give us a mapping of a “execute only” region with write privileges. The downside of this is that we need to restart the application before the changes take place. So we still do at least one restart.
Before restarting it’s important to revert the changes we did to the kernel heap.
We also register a new syscall 0x25 which points to a memcpy function
(0xfff09e44
on 5.5.x) to keep an easy way to perform copy operations with
kernel privileges.
Userland code execution!
After performing the kernel exploit, setting up the memcpy syscall, mapping
the memory and restarting Mario Kart 8 we perform the exploit once again. Now we
can finally achieve code execution. Using the new memory mapping we can copy our
executable into the free 0x011DD000...0x011E0000
region. Afterwards we
override the “main()” function call with a jump to our code and switch to the
Mii Maker
. This will execution our payload in Mii Maker
context!
But we still have no real control of the kernel without kernel execution.
Unfortunately the free 0x011DD000...0x011E0000
region which we are using
for userland code execution has no kernel execution rights. I spent some time
to think of a solution when I remembered the RPX version of the homebrew
launcher. The RPX version of the homebrew launcher was intended to run as a
channel in an environment without kernel access, so it ships with its own kernel
exploit. It also has no access to the JIT-areas, but somehow achieves kernel
execution. Looking at the code reveals that there is a region in memory
(0x017FF000
, just before the JIT area) that is writable using the memory
mapping and also has kernel execution rights. This is enough to have
arbitrary kernel execution by placing a payload in this area and registering it
as a syscall. By changing a IBAT (controls the memory mapping) kernel
execution rights can be provided for any other region in memory.
payload.elf loader
In previous blog posts I talked about a homebrew environment where all
exploits should be able to load a payload.elf
from the sd card and execute it.
To achieve this we need to fulfill the requirements of the payload loader,
and run the payload loader afterwards. One of the requirements is having a
syscall which allows the modification of IBAT0 to gain kernel code execution.
The other ones are just the “default” kern_read
and kern_write
syscalls.
After installing these syscalls we just need to load the payload.elf loader
into memory and run it.
Based on the JsTypeHax_payload
I created a payload for the Mario Kart 8 exploit which sets up the needed
syscalls for the payload loader and copies the loader into memory.
The 0x011DD000...0x011E0000
region is barely enough to fit this “payload
loader installer” and the actual payload.elf loader
, but it somehow fits.
After copying the payload.elf loader
into memory it can be finally executed.
A arbitrary payload.elf
will be loaded from the sd card and executed. We
are finally done.
Conclusion
In the end I spent way more time on this than I ever would have thought. So many times I was so close to just giving up, but somehow the exploit was really addicting. Once again a big shoutout to Ramboglaz6 (aka NexoCube) who worked on this at the same time. We shared our ideas and tried to motivate each other. In the end we both came up with a working solution which is quite nice.
This blog post may not be the most technical one, and maybe not the most exciting one, but this is how developing such a exploit really is, at least in my experience. 95% of the time you’re just failing and trying different ideas. Several times you will be stuck, but somehow there is always a solution. On one side it feels like I’ve wasted way too much time on this, but on the other side I also learned so much. And it feels nice to actually finish such a demotivating project. Even if no one will ever actually use it.
How can I find the code
I put all of the code on Github:
- https://github.com/wiiu-env/Mario-Kart-8-Exploit
- https://github.com/wiiu-env/Mario-Kart-8-Exploit_payload