No. 40 (299), issue 14Pages 141 - 155

Thread “Efficency” on the Shared Memory Multiprocessors

M.O. Bakhterev
It is tradition to assume that computation decomposed in the certain way into several threads is executed on the systems with shared memory (SMP or NUMA) more efficiently than the same computation but decomposed into several processes. In the presented work we hypothesize that this assumption may be false for the computations with big data volumes, mainly by two reasons. Firstly, the support of common shared address space for the treads may introduce substantially more overhead than aggregate expenses on the execution context switching between processes. Secondly, even when computation does not require intensive memory management, the natural limitation for the memory workset description volume stored in TLB results in necessity to frequently renew that translation cache in the case of using threads too. Experiments and their results which prove our hypothesis correctness are described later in the article.
Full text
shared memory, performance, threads, processes.
1. AMD64 Architecture Programmer's Manual Volume 2: System Programming. Advanced Micro Devices, 2011.
2. Appleton R. Understanding a Context Switching Benchmark. Linux Journal, 1999, no. 57, pp. 1-6. Available at (accessed 16.10.2012).
3. Sigoure B. How Long Does it Take to Make Context Switch?. 2010. Available at (accessed 16.10.2012).
4. Powell M.L., Kleiman S.R., Barton S., Shah D., Stein D., Weeks. M. SunOS Multi-thread Architecture. Proceedings of the Winter USENIX Conference, Dallas, TX, 1991, pp. 65-80.
5. The SPARC Architecture Manual. Version 8. Menlo Park, CA, SPARC International Inc, 1991.
6. Uhlig V., Dannowski U., Skoguld E., Haeberlen A., Heiser G. Performance of Address-Space Multiplexing on the Pentium. 2002. Available at ahae/papers/smallspaces.pdf (accessed 16.10.2012).
7. Jacob B., Mudge T. Virtual Memory in Contemporary Microprocessors. IEEE Micro, 1998, vol. 18, no. 4, pp. 60-75.
8. Tanenbaum A. Modern Operating Systems, 3rd Edition. New Jersey, Pearson, 2009.
9. Hunt G., Larus J.R., Abadi M., Aiken M., Barham P., Fahndrich M., Hawblitzel C., Hodson O., Levi S., Murphy N., Steensgaard B., Tarditi D., Wobber T., Zill B.D. An Overview of the Singularity Project. Microsoft Research, 2005. Available at (accessed 16.10.2012).
10. Linux 3.3.4 source code, switch_mm function. Available at (accessed 16.10.2012).
11. Bakhterev М.О. Thread proc benchmark. 2012. Available at (accessed 16.10.2012).
12. Snyder P. tmpfs: A Virtual Memory File System. Proceedings of the European USENIX Conference. France, Nice, 1990, pp. 241-248.
13. Shin J., Tam K., Huang D., Petrick B., Pham H., Hwang C., Li H., Smith A., Johnson T., Schumacher F., Greenhill D., Leon A., Strong A. A 40nm 16-Core 128-Thread CMT SPARC SoC Processor. ISSCC Digest, 2010, no. 56, pp. 98-99.
14. Baumann A., Peter S., Sch'upbach A., Singhania A., Roscoe T., Barham P., Isaacs R. Your Computer is Already a Distributed System. Why isn't Your OS? Proceedings of the 12th HotOS Workshop. Monte Verit`a, 2009, p. 19.
15. Bakhterev M.O., Vasev P.A., Kazantsev A.Y., Albrekht I.A. RiDE: The Distributed Computation Technique [Metodika Raspredelennyh Vychislenij RiDE]. Trudy Konferencii PaVT'2011 [Proceedings of PaCT'2011 International Conference]. Moscow, 2011, pp. 418–-426.