<p>In parallel with the new Japanese flagship supercomputer, Fugaku, we have continued improving a nonhydrostatic icosahedral atmospheric model (NICAM). Here, we introduce the results of our system-application co-design since 2014. Fugaku's CPU (A64FX) is based on the Arm instruction-set architecture. This 48-core many-core CPU is equipped with 32GB of HBM2 memory, showing data transfer performance comparable to GPUs. We have implemented kernel-level optimizations to take advantage of Fugaku's high memory performance. Among them, we recognized trade-offs related to ensuring memory locality and parallelism, and register allocation. We improved the application's average arithmetic intensity through detailed loop-by-loop performance measurements and reduced memory pressure by actively using single-precision operations. We also redesigned the data layout and the file I/O component of the ensemble data assimilation (DA) system and achieved good scalability in the atmospheric simulation and DA. We performed a global 3.5km mesh, 1024-member ensemble simulation, and DA using 82% of the Fugaku system (131,072 nodes, 6,291,456 cores). In this world's most massive ensemble DA benchmark experiment, the simulation and the DA achieved 29 PFLOPS and 79 PFLOPS of effective performance.</p>