Interview with Anthony Williams about parallelization

C++ World Café at the C++ UG Karlsruhe on June 14th 2017

Anthony: Anthony Williams (via Skype)

Timo: Timo Bingmann (table host)

Other: other participants

Timo: Hi Anthony. I am Timo. Nice talking to you.

Anthony: Hi.

Other: I am Tobias.

Other: And I am Markus.

Timo: We are joining you here from Karlsruhe. I had the pleasure of looking up your book this morning. We are probably going to get it because it contains a lot more information than I had expected. And maybe we are going to have to give it to our students, so they might have to learn it. So: We managed in the last 1.5 hours to fill a table (can you see this?) with lots of nice ideas.

Anthony: Nice.

Timo: The topic was: "Which support and advantages does C++ have for parallelization?" I photographed the table for you; can you see the pictures of our table?

Anthony: Yes, I can see writing on the table.

Timo: So, we found out that there are a lot of different frameworks that are available in C++. So there is the C++11 thread complex with std::atomics and so on. There is OpenMP, which we managed to distill down into two lines, because that's what everybody needs; and that's all you need to know. And interestingly, one of our people here was using Qt's threads, QThread, because everything is written in Qt in their application; and then they have to use QThread, and these are apparently very good. I mean they're pretty old as well, right?

Anthony: Yea.

Timo: And then we ventured into a different scape, namely SIMD parallelization.

Anthony: Oh -- yes.

Timo: What we didn't actually think about immediately because we were like: "parallelization is multi threaded stuff and not vectorization". But apparently, OpenMP also has SIMD support, and that seems to be more important these days than one would expect.

Anthony: Yes. Compilers do that, vectorize constructs directly - to some extent, anyways.

Timo: To some extent. But the thing is: you have to have some kind of guarantee (right?) that they will actually produce SIMD code. Yes, they do auto vectorization things.

(At 3:10)

Timo: A lot of these things are random thoughts, of course. Let's see: Then people deviated into multi process parallelization, interestingly. And then the whole complex of interprocess communication comes up, where MPI is basically the standard solution. MPI is good but not perfect. It was good 20 years ago, nobody here knew whether it is still good.

Anthony: I think people still use it, and it solves their problem. Particularly, people tend to use that more on cluster computers.

Other: I think it is mostly an HPC framework.

Timo: Yes. Somebody added that she is using it because of the many libraries that are available in MPI (sorry for the German notes, "Bibliothek" means library).

Anthony: That's ok.

Timo: So, due to the many libraries, particularly numeric computation libraries, that are in Open MPI, she uses Open MPI on a single computer. That is also parallelization of C++ code, right?

Anthony: Yes.

Other: And sometimes Fortran.

Timo: Right.

(At 4:40)

Timo: Back to the first question: "Which support and advantages does C++ have for parallelization?". So, the support part was the easy part. The advantages was the hard part. Now I am interested: what are the advantages that you would name?

Anthony: The question is: advantages over what? What are the alternatives?

Other: I think it's two things: It's either the advantage over using some library that isn't portable, like pthreads. Or compared to the standard library support which we have in C++ or in other languages for threads.

Anthony: Yes, comparing the C++ standard library to the standard library from other languages.

Timo: And the available parallelization libraries/things/features.

Anthony: I guess that also depends on what you are trying to achieve. One of the exciting things coming with C++17, which I know that Microsoft is actually working on (and there is also the HPX implementation of http://stellar-group.org/2015/05/hpx-and-the-cpp-standard/ and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4411.pdf), is the parallelization of the STL.

Timo: Oh right, we completely missed out on that. Yes, there is a parallel STL. So, actually, there is an older one, that's called MCSTL, and the gnu parallel version, that was initiated at our institute here. And Microsoft copied the idea ten years later and will hopefully make it correct this time -- production code and not academic code. Yes, that is definitely also important.

(At 6:55)

Timo: So other advantages are the usual things: zero cost abstraction and full control over what your code is. So what I added is the expression power, but also the expression safety that you have in C++: that your C++ code executes these statements in this order, right? So, the safety to work in parallel, i.e. that your code is optimized, but not weirdly enough that it will interfere with the parallelizations.

Anthony: Yes, you should use the synchronization tool that we've got. And that gives you quite a lot of scope for that, to make sure that everything is suitably serialized.

Timo: ...that it doesn't change too much in your code. Maybe that's a different way to express it.

Anthony: Yes.

Other: But I think the semantics of these synchronization tools are also sometimes a bit weird; or for somebody who hasn't done a lot of work in parallelization, getting into this whole tool set isn't that easy -- it wasn't that easy for me.

Timo: And one of the other big advantages that then someone came up with is thread sanitizer (TSan), interestingly enough. Even though it is a tool, I don't believe any other language has a good tool for machine-close parallelization; for checking, to write correct parallel code so close to the machine.

Anthony: Yes. I am just trying to have a think -- I can't think of any.

Timo: So thread sanitizer is amazing, right? And without thread sanitizer, you basically have to think about each statement and if each statement has some kind of race condition and indeterminism inside it. Even though it is a tool and not part of the language, it's still really a game changer for parallelization. That's what we wrote here. Interesting, right?

Anthony: Yes.

Timo: It is the first tool that actually enables you to write large parallel programs that work correctly, that do not have race conditions. Right?

Anthony: I think people would contend the idea of being the first tool on that... to some extent others have done that for 20 years.

Timo: Yes. There are lots of simulators, not instrumentations, that would do that as well. But you can't write a huge program like Chrome with that.

Anthony: No. Thread sanitizer is very good and available to use.

Timo: And the last advantage that we wrote down is the portable support of the memory model that is engrained in the atomics. The portability is the important part.

Anthony: Yes. That is one of the key things, in my point of view, that we got out of the C++11 standard: the concrete memory model.

Timo: And that it is portable also onto weird processors, however well it is portable, right? That is, of course, the question.

Anthony: Yes.

(At 10:55)

Timo: Those were the advantages that we came up with. We worked very hard, we left out all the disadvantages... Do you have an idea what we could add to our table?

Anthony: Hm. I think to a large extent, one of the key benefits from using C++ for your parallelization, is that it's C++. And so it is inherently close to the metal and you get the full scope of optimizations that you get for serial code. And then that's now available for your parallel code, too. If you are writing, on the other extreme, parallel code in Python, then you miss out on a lot of optimizations for the potential for making your code actually to some extent quicker to write for some aspects.

Timo: So the closeness to the bare metal of C++ enables you to use all of the optimizations.

Anthony: Yes. For example, Microsoft's trying very hard for the optimizing compiler for C#, but it's no way near up to the scope of their C++ optimizer. And so even if it is compiling to native code rather then Bytecode, like the JIT compiler does, it doesn't reach the level of performance for actual CPU intensive code.

Timo: Yes. If you think about how much decades of man power have been put into the gcc or clang optimizers, it is mind boggling.

(At 12:55)

Timo: Interestingly, we have a wish list. We wrote down advantages and a wish list. The await async pattern from C#. Do you know that one?

Anthony: Yes.

Timo: Apparently await is a keyword and does magic chaining of future results. People wanted that here, badly.

Anthony: With the coroutines TS! Which currently has an implementation in VisualStudio, and as of last week clang has an implementation.

Timo: Of the await pattern?

Anthony: Yes, of coroutines TS. It gives you the bare bone facilities to write that.

Timo: Oh, wow.

Anthony: So co_await is your key word you use instead of await, and then the library writer has to provide support to make futures work with your coroutines and add them asynchronously under the covers.

Timo: Cool, we have to check that out because with this pattern, you can write the asyncs much easier, right? Otherwise, for example in Boost.Asio, you get this huge cascade of lambdas going down all the way.

Other: Just like Javascript callbacks.

Timo: Yes, they are Javascript callbacks.

Anthony: Yes, it is exactly the same: callback hell.

Timo: With this co_await and support, that would be really cool, I have to check it out.

Other: I think there have been some macro and TMP approaches for coroutines, but they are a bit ugly and don't work in all cases, if I recall correctly.

Anthony: Yes, so: VS 2017 does it, and clang 5 on trunk.

Other: I think I saw a demonstration in the VS code, where somebody implemented these.

Anthony: So Gor Nishanov has done presentations on it, particularly on CppCon last year. He called it the magic disappearing coroutines, about how their code simplifies down, so actually there is no overhead at all and it looks just like straight code to me. But there is also the potential for making that work with async aided parallelism.

Timo: Wow, cool. I definitely have to check that one out. I haven't actually used too many asynchronous calls yet. I'm not quite sure why. I always go down to the usual std::thread versions. That's just old me coding.

(At 16:40)

Timo: We have more wishes. But since you already granted the first one, maybe we are going to get lucky with the second one as well. We were talking a lot about SIMD parallelization, and that writing it in C++ is currently like writing assembler code. At least in the old way of writing SIMD code. People usually also give all of the SIMD variables register names, right? We wish some nicer way to write SIMD programs using TMP and generative programmings. I think there is Boost.SIMD that attempts this.

Anthony: Yes, there is.

Other: I mentioned a library. I am not sure of the name of the developer [maybe Agner Fog or Tuomas Tonteri?]; he wrote a huge manual for optimization in C++ and assembler. He also wrote a vector library, which you can use to write kind of high level code with SIMD registers, where you have the data type ivec4 which is for integers, with which you can do simple arithmetics and also more complex calls.

Other: I think there are a few libraries that allow some data structures and have backends for SIMD implementations, which take some of the problems away from the normale developer. And having that easier accessible would most likely help.

Anthony: Yes, there are active proposals before the standard committee about having SIMD support.

Timo: So some kind of library. There is lots of work apparently on this, but no standardizations?

Anthony: Oh yea. Lots of discussion, but nothing in the papers yet, in the actual standard.

(At 20:10)

Timo: And then we were asking: what kind of other parallelizations would one want to have? Someone said: A parallelization like shells provide. So shell pipes actually are a very convenient way to express how to plug one program into another, right?

Anthony: Yes.

Timo: The only interface is a pipe stream. This is a really interesting idea I wouldn't have come up with that. But of course. these processes are run in parallel and independently.

Anthony: Yes, there is a proposal for that as well.

Other: I think it is coming from Java8, they have implemented some streaming syntax.

Anthony: There are various things called channels and stuff. So, there is a concrete paper.

Timo: Great, we are crossing off every item on our wish list.

Other: Ok, so these pipelines can be used in parallel as well. So the idea is having buffers in between which are like a producer consumer queue, filled and emptied -- or what is the basic approach?

Timo: Oh nice. So that is what I basically told you: We want to have multiple processes that talk to each other via a very simple interface, and using these, you can chain functions -- these are functions right? Via some kind of queuing input and output streams.

Other: So this is basically also this parallel for functional programming.

Timo: Right, so this is also our wish list for parallelization of functional programming in C++. So we basically have a chain of functions, and if they branch out, you can run multiple of these in parallel. So taking the idea of functional programming, which is really popular, right?

Anthony: I just posted a link to the SIMD paper as well. About the n3534 paper: It is all about pipelines, and the leading example like unix pipelines, just like you would describe it.

Timo: Nice.

Anthony: These are proposals, so they are not in the standard yet. But there is a lot of discussion in the committee about them. These are directions that people want to go. So you are not the first to come up with these ideas, but because you are not alone and people in the committee are actually discussing it, there is real reasonable chance that you will see something along these lines to come out.

Timo: Wow -- these are so many topics. So parallelization has so many things in its scope. The topic is huge and difficult.

Anthony: Yes. Everyone comes up with new libraries and facilities to try to make things easier. Paul E. McKenney from IBM, who works on the Linux kernel, wrote an ebook called "Is Parallel Programming Hard, And, If So, What Can You Do About It?". There's a whole lot of patents in there on how to deal with various issues.

Timo: There was also a book called something like "parallel programming patterns" [maybe Structured Parallel Programming: Patterns for Efficient Computation?], as an equivalent to software design patterns. I think it was also the basis for the Intel Thread Building Block (TBB) library (which apparently did not make an appearance on our table yet, but of course the TBB is very important). The TBB library tries out all of these parallelization patterns. But the parallelization in C++ is of course much more and much more difficult and complex, right?

Anthony: Yes, TBB is an important library. And yes, there is a lot. But it all depends on how you are structuring your program, as to which are the best libraries and things to use, and how you want to structure and what support you need.

Timo: That's also the only disadvantage that I wrote down: The huge amount of choice you have.

Anthony: Yes, there isn't one standard way to do it.

Timo: That is always one of the critique points of C++: That you have so many ways to do something. That makes people insecure. We now filled one table of all the support that offers parallelization, and it is still difficult.

Anthony: Parallelism requires more thought and more care.

(At 28:00)

Other: Did we talk at all about the whole topic of synchronization? Because I am not sure at which time these atomics and mutexes appeared in the standard. They also provide us with the tools I know mostly from other languages, which already had implemented them. I think some of the features of C++ make it really nice and easy to use them. For example, in Java we had these synchronize blocks where we had to explicitly state which part is synchronized, and in C++ we can do basically the same with the lock_guards, which is scoped by RAII. I think all these language features also allow us to get some kind of safer implementation than just locking a mutex and explicitly unlocking it, which sometimes can lead to difficult race conditions or difficult deadlocks.

Timo: I remember a long time ago Java boasting to be one of the few languages supporting threads and having synchronized functions. It is actually a good contrast to what we have in C++, I think. Even though we have a table full, do we have all of the aspects that parallelization has? I think we have at least one library for each aspect, for the different facets that parallelization encompasses, right?

Anthony: Yes

Other: But we have MPI which can work in the whole distributed memory situation, or multi-process parallelization, but we haven't written down any solution which supports this on C++ natively -- especially message passing.

Anthony: Well, I thought it's part of a standard.

Other: You mean for interprocess communication?

Other: Yes, like not just passing around pointers or blocks of data, but passing around things with semantics.

Other: But there are C libraries for message passing and interprocess communication.

Other: But again, these C libraries don't use the huge type system of C++.

Other: Yes, but you loose that as soon as you cross process boundaries.

Other: You don't if you use something else.

Anthony: [Sends https://github.com/STEllAR-GROUP/hpx about HPX]

Timo: Oh, the HPX.

Anthony: Have you used HPX?

Timo: I have heard of it.

Anthony: Ok. There was an implementation of the parallel STL, which is good. But also, they supposedly support multiprocess parallelism as well, seamlessly. So that takes on a lot of what MPI might cover; hopefully MPI is covered within HPX, too. Personally, I haven't tried HPX in the multiprocess facilities. I only used it inprocess, with threading. But it is supposed to work, and they work hard on that. It is the high performance computing team, they use it on big cluster systems.

Timo: I have come across this before but haven't tried it. Supporting all of the high performance computers is really difficult. That is much of what I do: Thinking about how to program large computers and how to have an expressive language. And how to implement distributed algorithms that will do what the users wants, and chain these together.

Anthony: Yes.

(At 33:15)

Timo: So: did we miss some things?

Anthony: Yes. I saw on your picture that you had a little reference next to futures, which said ".then". That is in the concurrency TS, which you haven't written down there.

Timo: The extensions to futures, right? Yes, .then of futures is definitely needed, and also a correct futures implementation. I don't know if you know about the bugs that are in there.

Anthony: Yes.

Timo: Really difficult. Yes. And I am really looking forward to the async await, or co_await.

Anthony: Yes, co_await was the keyword for the coroutines. Whether co_await can already be used depends on the function you are using it with; whether it is just a single-threaded coroutine or whether it has got an asynchronous call under the hood.

(At 34:30)

Other: Are you aware of any new parallel developments in the C++ world that are coming in the near future?

Anthony: Well. So the C++17 support is coming, with the parallel STL. I know that the Microsoft team is actively working on it, and the team behind HPX at Stellar Group -- if you haven't yet got a C++17 compiler with other bits than shared mutexes and stuff. So there's few bits coming down there. Then there is the all-known discussion: the executor proposal, which is things like thread pools and how to build your own custom thread pool, and general specifying aid to run your code [implementations: https://github.com/executors/issaquah_2016, https://github.com/chriskohlhoff/executors]. How to specify your tasks. And then you use an executor to specify where those tasks are run.

Timo: It is definitely needed because in my observation, there are now many libraries that support parallelization, but none of them work together, right?

Anthony: Yes. But if you use Boost.Asio, then that has a set of executors in there, the strands and the thread_pools and that sort of stuff.

Timo: Yes, I use that a lot.

Anthony: The idea is then to extend that it covers more options as well.

Timo: Oh, cool.

Anthony: And how you provide all the options; I think the current proposal ends up with you having about 50 customization points, which makes it a bit daunting: Which things do I use? And if I am trying to use it, which one do I need to implement? It is quite complex trying to get a framework that suits many people.

Other: Yes. And I think thread pools is an important missing piece, because otherwise, if you are too successful with your parallelization, you overcommit on your processor, and it is a real danger that I have seen with big applications. Like you have two independent parts that are parallelized and both try to get all the resources.

Timo: Yes. And one uses OpenMP parallelization, the other Boost.Thread, and the third, i don't know, all different kinds of threads. And then it gets really daunting. So, in my feeling, even though the table is covered, there are so many facets of parallelization with C++. Since we are so close to the metal, you can just argument wanting to do all of these things. If you only have a high level abstraction, then you can't go down deep into these things, right?

Anthony: Yes.

Timo: Even though there are so many facets, I think many of them are really useful for what you need. And you have all of them. Other languages don't have the wealth of parallelization possibilities that we have.

Anthony: That is certainly true.

Timo: That is definitely also one of the advantages that C++ has for parallelization, that we are missing on our table.

(At 38:20)

Timo: But it remains really difficult to write good and fast parallel code, but we have no other choice, right?

Anthony: Yes. There is the challenge. I know that the Microsoft team is working on the parallel STL because Billy on the team would say: "The implementations I come up with at the start are actually slower than the serial code until they're processing containers with many elements, because of the extra complexity of doing parallelization." So trying to get it right is a real challenge.

Timo: So the fastest processors with the fastest clock cycles have 4 GHz, right?

Anthony: Yes.

Timo: And we are not going to get 8 GHz. Well, maybe.

Anthony: Probably not. Not until they come up with better cooling.

Timo: Yes, but they may, right? They are really good at pushing it even further and further. But there is a limit. And then the only other way to get performance increase is parallelization, right?

Anthony: Yes.

Timo: There is no away around it. And even though we have so many possibilities now, it remains difficult. I guess I really have to get your book and we are probably going have a course for students to do C++ performance coding, and parallelization -- especially correctly parallelizing things -- is going to be a big part of it.

Anthony: Ok. Then you should get my book. There is a second edition that is currently only available as pdf because it is not finished yet.

Timo: Ok, cool. Any questions left?

Other: Well, I missed the beginning, I was at the Robert Ramey table. I am happy that the technical part was no problem, and you seem to have covered a lot of topics -- when I am looking at the table and what I have heard at the end. Great. So, we have recorded this, I am really curious and will listen to it. And I will get back to you and show it to you before I publish it, so you can check it.

Anthony: Ok. It was nice speaking to you.

Timo: Yes, definitely. It was nice speaking to you, thank you.

Anthony: Bye bye.