Dynamically Translating Binary Code for Multi-Threaded Programs Using Shared Code Cache

2014-03-24 05:40ChiaLunLiuJiunnYeuChenWuuYangandWeiChungHsu

Journal of Electronic Science and Technology 2014年4期

Chia-Lun Liu, Jiunn-Yeu Chen, Wuu Yang, and Wei-Chung Hsu

Chia-Lun Liu, Jiunn-Yeu Chen, Wuu Yang, and Wei-Chung Hsu

——mc2llvm is a process-level ARM-to-x86 binary translator developed in our lab in the past several years. Currently, it is able to emulate singlethreaded programs. We extend mc2llvm to emulate multi-threaded programs. Our main task is to reconstruct its architecture for multi-threaded programs. Register mapping, code cache management, and address mapping in mc2llvm have all been modified. In addition, to further speed up the emulation, we collect hot paths, aggressively optimize and generate code for them at run time. Additional threads are used to alleviate the overhead. Thus, when the same hot path is walked through again, the corresponding optimized native code will be executed instead. In our experiments, our system is 8.8X faster than QEMU (quick emulator) on average when emulating the specified benchmarks with 8 guest threads.

Index Terms——ARM, binary translation, hot path, mc2llvm, multi-threaded program, QEMU (quick emulator).

1. Introduction

Binary translation emulates one instruction-set architecture (ISA) through another ISA. Its applications include legacy program migration, program instrumentation[1], fast simulation[2], and security investigation[3], etc. mc2llvm[4]is a process-level ARM-to-x86 hybrid binary translation system developed in our lab in the past few years, which is able to do static and dynamic binary translation. Currently, it is only able to emulate sequential programs. Our main task is to add the support to mc2llvm so that it can emulate multi-threaded programs. In this paper, the emulated ISA is called the guest, and the emulating ISA is called the host.

Dynamic binary translator (DBT) takes the guest executable file as data. At run time, guest instructions are translated to host instructions whenever needed. Thetranslated host instructions are stored in the memory place called the code cache. Due to the code cache, subsequent emulations of the same instructions can be done by directly executing the generated native code in the code cache without repeated translation. In this research, we make the following contributions: 1) reconstruct mc2llvm to emulate multi-threaded ARM binaries using the shared code cache; 2) craft an efficient method to do lookup in the address mapping table; and 3) add the trace compilation to mc2llvm. The remainder of this paper is structured as follows. Section 2 sketches the binary translator mc2llvm. Section 3 illustrates the implementation details. Section 4 discusses the experimental results. Finally, this research is summarized in Section 5.

2. Background

mc2llvm[4]is an ARM-to-x86 process-level LLVM (low level virtual machine)-based[5]hybrid binary translation system. It is able to do static and dynamic binary translation and both guest and host operating systems (OS) should be Linux. Since our work focuses on dynamic binary translation, we would only describe the dynamic binary translation of mc2llvm in this paper. To emulate a program, mc2llvm first maps the text and data segments of the guest executable into the host memory directly without adding any offset. mc2llvm starts translation from the entry point specified in the guest ELF (extensible linking format) file. The translation unit of mc2llvm is the basic block. Three steps are involved in the translation. Guest instructions in the guest basic block are translated into the LLVM intermediate representation (IR) and then the LLVM optimizer attempts to optimize the LLVM IR. Because the optimization is performed at run time, only slight optimization is done at this stage in order to reduce run-time overhead. Finally, it uses the LLVM JIT (just-in-time) compiler to emit x86 instructions. The emitted instructions are kept in the code cache. Thus, the subsequent emulation can be done by directly executing the generated native code in the code cache without repeated translation. Before translating at a guest address, mc2llvm looks up the address mapping table to check if the corresponding translated host instructions exist. If they exist, the table-lookup routine returns the host address where mc2llvm can execute code in the code cache. The next guest address is obtained when mc2llvm completes the emulation of a guest basic block by executing its translated code in the code cache. Some ARM instructions cannot betranslated to LLVM IR directly such as the svc instruction. This instruction triggers a system call. mc2llvm translates this instruction into a call to a wrapper function that handles the corresponding guest system call. The execution flow is presented in Fig. 1.

Fig. 1. Program flow of mc2llvm: g represents a guest address and h is the host address.

3. Design and Implementation

Our modified architecture of mc2llvm for multithreaded programs is presented in Fig. 2. Two kinds of threads are involved in the emulation. One is called the emulating threads, which emulates the behavior of each guest thread. Each emulating thread has its private data structure that keeps the guest architecture state for the guest thread. The guest architecture state mainly includes the ARM guest registers, the base address of thread-local storage, and the condition flags.

Fig. 2. Architecture of the modified mc2llvm.3.1 Thread Creation and TerminationA new child thread is created by the clone system call. Its stack and thread-local storage are prepared by its parent

The other is called the optimizing threads. It periodically tries to take a trace from a FIFO (first-in first-out) queue of hot paths. If one is available, the optimizing thread would use the trace generator to generate the host instructions for the trace and then stores them in the shared trace code cache. The number of the emulating threads is determined by the programs being emulated and the number of the optimizing threads is set to 3 in our experiments. thread. When a guest clone system call is encountered, we first call the function pthread_create to produce a new emulating thread. This thread is responsible for emulating the new guest thread. mc2llvm copies the guest architecture state of the parent emulating thread to the child emulating thread and sets ARM guest registers r0 to 0 and r13 to the new stack address and sets the guest base thread-local storage to a value given by the guest clone system call. The guest register r0 of the parent thread is set to the child’s thread id. A parameter in the clone system call indicates the part of the parent’s data structure that should be copied to the child thread.

A thread is terminated by the exit or exit_group system call. When an emulating thread encounters the exit system call, it does some cleanup then calls pthread_exit. But before it kills itself, the emulating thread has to test if child_tidptr, which is defined below, has a non-zero value. If so, it does a futex wakeup at the child. To emulate exit_group, mc2llvm simply terminates the programs.

chid_tidptr is the name of a parameter of the clone system call. When a guest clone system call is invoked, the new emulating thread keeps this value if necessary by testing a flag, which is also a parameter of the clone system call.

The emulation of the creation and termination of a thread is done in the wrapper function that emulates the guest system calls.

3.2 Atomic Operations

In our experiments, we observe two ways to perform the atomic operations. One is to call the kernel supported __kuser_cmpxchg function, which lies at a specific address. The other is to call the futex system call, which is the building block for the mutex.

__kuser_cmpxchg is a function that does compare-and-swap atomically and sets the condition flag C to 1 if the operation is successful or 0 otherwise. We emulate __kuser_cmpxchg by doing the compare-and-swap and guarding the operation with a lock/unlock pair. For the guest futex system call, we emulate it by calling the host futex system call. Note that both the host and the guest are Linux systems in our experiments.

3.3 Thread-Local Storage

In our environment, the base address of the guest thread-local storage is read by the kernel-supported function. __kuser_get_tls is set by the ARM-private system call set_tls. Each emulating thread has a member tls in its guest architecture state to hold this base address. set_tls sets tls to a given value while __kuser_get_tls only returns tls. Since the thread-local storage of every thread, including the main thread, is allocated at run time, mc2llvm does not need to specially pre-allocate the space for each guest thread.

3.4 Shared Code Cache

Because the operating system performs the contextswitch of the architecture states, multiple threads on the same CPU can execute the same code without interferingwith one another. Under the emulation, the guest architecture state is maintained by the respective thread. Hence, the OS cannot perform the context-switch of the guest architecture states automatically. It becomes the duty of the translator to make sure that emulating threads work with the correct guest architecture states. In mc2llvm, to load an ARM guest register, such as r1, we first load the guest architecture state of the current emulating thread and then load the member in the guest architecture state that corresponds to r1. In this way, when two emulating threads operate on the same guest register, each would load its own private register. Therefore, code cache can be shared because the registers used in an instruction refer to the private registers of the executing thread. The original mc2llvm, which is intended for single-threaded programs, maps each guest register to an individual LLVM global variable. The emulation of single-threaded programs can work correctly under this register mapping because only a single emulating thread manipulates the LLVM global variables. The alternative code cache is the private code cache, that is, each emulating thread keeps its own private code cache. More memory would be needed when the private code cache is used.

3.5 Address Mapping Table

In Fig. 2, there are two kinds of address mapping tables: global-shared and thread-private address mapping tables. The global-shared table contains all the address-mapping entries and the thread-private table contains only the entries recently used by the owner thread. Both tables are hash tables.

Fig. 3 presents how to access the address mapping tables. For a read access, the emulating thread looks up its own thread-private address-mapping table first. If not found there, the global-shared table will be searched next.

Fig. 3. Write and read accesses to the address mapping tables. g is guest address. h is host address. (g,h): the mapping of guest and host address.

For a write access, the emulating thread would store the address-mapping entry in both the global-shared and thread-private tables. The thread-private tables would speed up the search significantly because the access to the private table is free of synchronization while the global table has to be locked before the access.

In our experiments, table lookup is much faster than synchronization. Although the thread-private tables store only recently used address mapping entries, the hit rate of read access in our experiments is very high (98%). As a result, the global-shared table is used only infrequently. The thread-private table is a simple array with no more than 10000 entries because we discover that programs in the benchmark needs 1000 to 7000 address mapping entries.

3.6 Synchronization

In our translation system, there are five major critical sections:

· The instruction translator

· The global-shared address-mapping table

· The concurrent FIFO

· LLVM functions used in the system

· The emulation of atomic operations

The instruction translator is invoked when an emulating thread fails to find the address mapping in the global-shared address-mapping table. The case is that more than one emulating thread may fail to find the address mapping in the table and all the emulating threads then call the instruction translator almost at the same time. Therefore, the instruction translator is guarded by a global lock.

The instruction translator is not used very frequently because instructions are translated only once. Consequently, a global lock does not hurt the performance much. The global-shared address-mapping table is crafted based on [6]. The global-shared address mapping table is a hash table composed of an array of lists. Atomic CAS (compare-andswap) instructions are used to lock the table before elements are inserted or deleted. If the CAS fails, the whole operation repeats until it succeeds.

The concurrent FIFO is a bridge to pass traces selected by the emulating threads to the optimizing threads. A well-known method in [7] is employed. The technique is similar to [6]. As for LLVM functions, LLVM has an internal mechanism to prevent the race condition. Finally, the atomic operations are emulated with a lock/unlock pair.

3.7 Trace Compilation

The emulating threads perform profiling and select traces. The selected traces are kept in the concurrent FIFO. The optimizing threads compile and optimize the traces in the FIFO one by one. At the beginning of a translated block, we patch the code to perform trace selection. The trace selection is presented in Fig. 4. The trace is optimized with LLVM aggressive level. We insert several instructions in each translated block to test if it is the head of the trace. If so, the emulating thread would jump to the shared trace code cache to execute the instructions of the trace. Otherwise, the emulating thread executes the host instructions in the shared block code cache.

Fig. 4. Trace selection. The emulating thread starts at TRACE HEAD DETECTION. TB stands for translation block. L is a linked list.

4. Experimental Results

ARMv5te statically linked binaries are fed into the binary translator. We run the translator on a 48-core x86_64 host machine with the Debian Linux operating system. Each core is 2.1 GHz and the memory is 48 GB. The benchmark we use is SPEC omp 2001. All the benchmarks are compiled with GNU gcc or gfortran 4.3.2 with flags -O2 -static -fopenmp. Among the 11 benchmarks in the SPEC omp, our translator fails to compile galgel, and fails to emulate ammp. To build mc2llvm, we have to install LLVM-3.2 library beforehand. Four experiments are conducted and described as follows.

4.1 Parallel Emulation

We experiment with various numbers of threads for each benchmark. The resulting emulation time is shown in Fig. 5. The experiments show that the emulation time decreases when more threads are used, except gafort.

Fig. 5. Ratio of the emulation time for mc2llvm that use various numbers of guest threads.

4.2 Comparison of QEMU and mc2llvm

Fig. 6 shows the emulation time of QEMU and mc2llvm. Each benchmark makes use of 8 guest threads. QEMU is slower than mc2llvm because, in QEMU, the access to the common address-mapping table is protected by a lock and the access frequency of the table is really high while mc2llvm provides a thread-private addressmapping table. In average, mc2llvm is 8.8x faster than QEMU.

Fig. 6. Ratio of the emulation time for QEMU and mc2llvm.

4.3 Comparison with ARM Native Machine-Origen

Origen is an ARM board. It is equipped with Samsung Exynos 4 Quad Cortex-A9 cores at 1.4 GHz and 1 GB memory. Fig. 7 shows the emulation time of Origen and mc2llvm. Each benchmark makes use of 4 guest threads. On average, Origen is 5.45x faster than mc2llvm.

Fig. 7. Ratio of the emulation time for Origen and mc2llvm.

4.4 Trace Compilation

Fig. 8 shows the emulation time of mc2llvm with and without trace compilation. Each benchmark makes use of 8 guest threads in this experiment. We take the SPEC omp 2001 benchmark with the ref data set as input because trace compilation normally takes longer time to show its advantage. On average, we obtain 33% performance speedup with trace compilation.

Fig. 8. Ratio of the emulation time for mc2llvm with and without trace compilation.

5. Conclusions

We extend the binary translator mc2llvm[4]from sequential-program emulation into multi-threaded-program emulation. The code cache is designed to be shared among the threads in order to reduce the memory usage. The synchronization overhead is mostly reduced by having thread-private address-mapping tables in addition to the global-shared address-mapping table. Runtime optimization is done by trace compilation with 3 extra threads to generate the code for the traces. According to our experiments, mc2llvm is 8.8x faster than QEMU when emulating programs with 8 guest threads. mc2llvm is 5.45x slower than Origen[8]when emulating programs with 4 guest threads. When mc2llvm performs trace compilation, it gains 33% performance speedup on average.

[1] N. Nethercote and J. Seward, “Valgrind: A framework for heavyweight dynamic binary instrumentation,” inProc. of 2007 ACM SIGPLAN PLDI, 2007, pp. 89-100.

[2] E. Witchel and M. Rosenblum, “Embra: Fast and flexible machine simulation,” inProc. of 1996 ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, 1996, pp. 68-79.

[3] M. Payer and T. R. Gross, “Fine-grained user-space security through virtualization,” inProc. of the 7th ACM SIGPLAN/SIGOPS Int. Conf. on Virtual Execution Environments, 2011, pp. 157-168.

[4] B.-Y. Shen, J.-Y. You, W. Yang, and W.-C. Hsu, “An LLVM-based hybrid binary translation system,” presented at the 7th IEEE Int. Symposium on Industrial Embedded Systems (SIES'12), Karlsruhe, Germany, 2012.

[5] V. Adve and C. Lattner. (March 2013). The LLVM Compiler Infrastructure. [Online]. Available: http://llvm.org/

[6] M. M. Michael, “High performance dynamic lock-free hash tables and list-based sets,” inProc. of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, 2002, pp. 73-82.

[7] M. M. Michael and M. L. Scott, “Simple, fast, and practical nonblocking and blocking concurrent queue algorithms,” inProc. of the 15th Annual ACM Symposium on Principles of Distributed Computing, 1996, pp. 267-275.

[8] Origen —— an arm board. (March 2013). [Online]. Available: http://www.origenboard.org

Chia-Lun Liu was born in Tainan in 1989. He received the B.S. degree from National Sun Yat-sen University, Kaohsiung in 2011 and the M.S. degree from National Chiao Tung University, Hsinchu in 2013, both in computer science. He is currently working as a software engineer at Gemtek. His research interests include operating systems and compilers.

Jiunn-Yeu Chen received both the B.S. degree and M.S. degree from the National Chiao Tung University, Hsinchu in 2006 and 2008, respectively. He is currently pursuing the Ph.D. degree with the Department of Computer Science in the National Chiao Tung University. His research interests include binary translation, system software, and parallel programming.

Wuu Yang received the B.S. degree from the National Taiwan University, Taipei in 1982 and the Ph.D. degree from the University of Wisconsin, Madison in 1989, both in computer science. He is currently a professor with the Department of Computer Science, National Chiao Tung University, Hsinchu. His research interests include embedded systems, systems software, programming language, and garbage collection.

Wei Chung Hsu received his Ph.D. degree in computer science from the University of Wisconsin, Madison in 1987. Currently, he is a professor with the Department of Computer Science and Information Engineering, National Taiwan University, Taibei. From 1999 to 2009, he was a professor with the Computer Science and Engineering Department, University of Minnesota, Twin Cities. From 1993 to 1999, he was a runtime optimization architect with the California Language Lab, Hewlett Packard Company. Prior to joining HP, he was a senior architect with Cray Research in Chippewa Falls, Wisconsin. His current research interests include virtualization, dynamic binary translation, and optimization techniques.

Manuscript received December 16, 2013; revised March 14, 2014. This work was supported by NSC under Grant No. NSC 100-2218-E-009-009-MY3 and NSC 100-2218-E-009-010- MY3.

C.-L. Liu is with the Department of Computer Science, National Chiao-Tung University, Hsinch (Corresponding author e-mail: bendog0224@gmail.com).

J.-Y. Chen, W. Yang, and W.-C. Hsu are with the Department of Computer Science, National Chiao-Tung University, Hsinch (e-mail: morningrain@gmail.com, wuuyang@cs.nctu.edu.tw, hsu@cs.nctu.edu.tw).

Digital Object Identifier: 10.3969/j.issn.1674-862X.2014.04.018

Journal of Electronic Science and Technology2014年4期

Journal of Electronic Science and Technology的其它文章: Intrinsic Limits of Electron Mobility inModulation-Doped AlGaN/GaN 2D Electron Gas by Phonon Scattering; Non-Linearity of Visual Sensitivity and Pursuit Velocity during Smooth Pursuit Eye Movements; Automatic Vessel Segmentation on Retinal Images; Family Competition Pheromone Genetic Algorithm for Comparative Genome Assembly; Quantification of Cranial Asymmetry in Infants by Facial Feature Extraction; Restricted-Faults Identification in Folded Hypercubes under the PMC Diagnostic Model