Imagine an awkward teenager in 1999 with two stacks of books beside his bed: biographies of famous physicists like Richard Feynman mixed with the history of the Manhattan Project, and business books about leadership, with stories of startups such as “Netscape Time” by Jim Clark. He’s installed Corel Linux on an ancient computer, and is trying to figure out what he wants to be when he grows up. You’d see him sometimes walking home from a used computer store carrying an IBM XT that he bought with his bus fare to take apart. He was always soldering something, getting electrocuted, and the occasional short circuit smoking emergency.
In 2001 this awkward teenager was now in university, so excited to be studying science that he’d be up in the library until it closed studying chemistry and physics. His favorite spot in the library was a study carrel near the biography section of scientists, and he’d pick random books to read about string theory and grand unified field theories for inspiration as he tried to understand the cosmos. He’d come home and stay up all night installing another operating system on an ancient computer – this time Solaris for x86. Even more awkwardly, he’d wake up at night talking in his sleep – apparently doing Calculus, as his roommate joked he couldn’t do that awake.
That’s me. I wanted to do something really big and impactful, but I also wanted to do it commercially. But I had no idea what it was yet.
Outside of my dorm room in the real world, the golden age of sequential clock frequency scaling continued. Software developers relied upon the “Free Lunch” era of processor performance improvements to transparently improve their applications and pay for the cost of their abstractions. Moore’s Law was eternal. If you needed more software performance, you simply bought a new CPU and plugged it in.
In 2002 my exotic interest in operating systems and used computers led me to purchase an Intel Itanium server from eBay. Here was an Intel CPU with amazing performance potential, and an architecture that looked really cool. But software applications did not get good performance on the Itanium. It had a C compiler, you can compile software for the Itanium, but somehow the compiled code just wasn’t very fast.
Donald Knuth, a famous computer scientist, said “The Itanium approach…was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.” This was the first time I learned about the mismatch between software and hardware. I was fascinated by this problem. I wondered: if I figured out how to fix that software mismatch, how valuable would this be to Intel? I started to become very curious about computer hardware, and the mixture of books by my bed began to have more computer architecture manuals and guides on performance optimization, displacing the physics books.
In 2004, clock frequency scaling ended. There would be no 10GHz or 80GHz CPUs. The world had no choice but to embrace multi-core processors and parallel programming, despite the fact that no approach to parallel programming had ever been successful.
Tim Mattson, Intel Principal engineer and expert in parallel programming said “We stand at the threshold of a many core world. The hardware community is ready to cross this threshold. The parallel software community is not.” John Hennessey, the computer architect who created RISC and today is the chairman of Alphabet, said “…when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. …I would be panicked if I were in industry.” Donald Knuth said “I decided long ago to stick to what I know best. Other people understand parallel machines much better than I do; programmers should listen to them, not me, for guidance on how to deal with simultaneity.”
Here I found my purpose, both commercial and scientific. What if I could be the person who solved the parallel programming problem? Bob Kent, a computer science and physics professor at the University of Windsor who employed me part-time to talk to him about parallel programming, GRID computing, and physics in a smokey bar on campus, suggested to me: what if instead of being a physicist, I was meant to be the person who helped the physicists doing nuclear fusion models in HPC with better parallel programming models?
The very foundation of computer science is sequential. The theory of computing, which is the “physics” of computing, is inherently sequential and hostile to parallel computing. Our programming languages are built upon that sequential theory. Parallelism was added to sequential languages in an ad hoc manner as an afterthought. Yet it was clear that the future was parallel, and computing needed something that had never been done before: an approach to parallel programming that was as easy as writing sequential code. This is what would build a valuable company, like the ones I read about as a teenager. What if I developed this amazing new approach and was able to keep the IP? What if I could restore transparent software scaling so that the same code, without any rewrites, would execute on any parallel computer? This is what would have saved the Itanium.
My intuition told me something must be wrong with the fundamental software model. There must be some reason that there was a disconnect between software and hardware. I wondered if this was like trying to build a nuclear reactor without nuclear physics? But I was full of self-doubt. I’m not a theorist. I’m not a mathematician. Who am I to tackle this incredibly hard problem that the experts and industry have concluded is impossible? I had not even finished my undergraduate studies in computer science. I have not even written a single parallel program.
I committed myself to finding a solution to parallel programming. I was at the wrong university. There wasn’t enough theory here, and there were no answers in a software engineering curriculum. I needed to go somewhere that had more resources, and a deeper focus on research and theory, with a lot more mathematics. With a strong recommendation from Bob Kent, I transferred to the University of Toronto in 2004 and embraced computer science theory and mathematics in an effort to become the person with the background to tackle this challenge. I knew that I’d have to review foundational work to understand a better approach, and since the pioneers of computing were mathematicians, I’d have to do something I dreaded: study advanced mathematics. I was terrible at math! But I had to learn it to understand the writing of the computing pioneers and be able to understand what was wrong with software.
There was no major in parallel programming at University of Toronto at the time. I had to “hack” my degree by choosing every course that could help me understand parallelism, from every angle both practical and theoretical. Every career and course choice was focused in some way on improving my knowledge or improving my skills related to the parallel programming problem. I took the hardest math courses at the university. I had to work incredibly hard, fail, and try again.
Thanks to a fellow student Jeff Kingyens, who went on to the compiler team at NVIDIA, I began to learn GPGPU computing with CUDA around 2007. Here again was another processor, a GPU, that could not easily be programmed. More evidence of the software-hardware mismatch.
When Threading Building Blocks (TBB) emerged, I became part of its open source community and learned everything I could about it. I knew that I needed a mentor, so I reached out to Arch Robison, creator of TBB, for guidance. He strongly encouraged my quest into parallelism and GPU computing. He patiently answered my naive questions. Thank you Arch for your encouragement, it was very impactful.
When OpenCL emerged in 2008, I saw a platform that exposed the most general problem: heterogeneous computing. How can software target any arbitrary heterogeneous processor? Now we have a bunch of unprogrammable processors, all weird in different ways. How can we write software once and have it automatically work on any architecture?
At the time at University of Toronto, nobody cared about GPUs outside of the engineering and physics department, I was far too early. I did a project course with Greg Wilson, author of “Parallel Programming With C++”, and built a primitive load-balancing scheduler for CPUs and GPUs into TBB. I provided informal training workshops on GPU computing at the university, attended by some of Geoff Hinton’s students, and helped physicists and engineers with their parallel programming problems for fun. It was deeply satisfying to help the people trying to understand the cosmos with their computing problems. When I did my undergraduate course in machine learning, I wondered: could I use machine learning with OpenCL to build a hardware-software interface that automatically learned and adapted to any target hardware?
When it came time to graduate in 2012, GPUs and heterogeneous computing remained unknown and niche outside of engineering and physics. I thought about graduate school to focus on parallelization and heterogeneous computing, but there were two major problems. First, the work I wanted to do was too controversial, extremely interdisciplinary, and very likely to fail meaning I would not get a PhD. I was advised it would be about 20 years of work until I was sufficiently secure in academia to take such risks. Second, the University of Toronto would own the IP of this work and I’d have to disclose any breakthrough to the world, which would quickly be taken up by more valuable competitors. I really wanted to build something commercially valuable and the IP was a dealbreaker. I needed a massive competitive edge over existing competitors, so that no matter how big they were, they couldn’t catch up because we had nuclear physics and they did not.
There were whispers that Moore’s Law itself was going to end. I wondered what software would be like when hardware performance potential simply stopped and specialized exotic processors were the norm. My university education had given me a foundation in mathematics, and computer science theory. But I did not have the time during the intensity of my undergraduate studies to actually sit down and read and think about all of the foundational papers of computer science.
So I made a decision.
Instead of graduate school, I would make friends with the professors who would have supervised me, and continue to work on this problem by myself at the campus library. Every day, for years, I hauled a backpack of textbooks and academic papers to the library to truly understand the nature of parallel programming. I audited graduate courses, with permission, often helping explain to students how the research they were learning was being applied in industry. I knew that the post-Moore’s Law world was coming and the rise of heterogeneity was inevitable. I started to design a platform that could take code and automatically execute on any processor from any vendor, using OpenCL as a hardware abstraction layer. Every design I considered had major flaws and introduced major complexity. I was starting to get a real feel for the complexity of the parallel programming problem.
I needed more mentorship. I had too many questions. So, I set myself upon a quest to meet the people who wrote the OpenCL specification. OpenCL 2.0 was released provisionally, and the Khronos Group asked for feedback. So, I wrote out my extensive thoughts about how to fix OpenCL and published it. In 2014, thanks to Intel, I was provided with a special invitation by Neil Trevett (NVIDIA) to join the OpenCL working group as an individual contributor. I was given a seat alongside every major hardware company to shape the future of heterogeneous computing. I was a neutral party, I had no processor to sell, no agenda. I could have free discussions with engineers at AMD, Intel, NVIDIA, Qualcomm, Imagination, and others. I saw the roadmap for the entire heterogeneous computing industry, and started to contribute. I wanted to fix OpenCL. But more failures. OpenCL is flawed because the underlying principles upon which OpenCL is based remain sequential. OpenCL, like CUDA, is an ad hoc extension of parallelism to a fundamentally sequential model.
In 2016 I prototyped a “heterogeneous computing operating system” using OpenCL as a hardware abstraction layer, in collaboration with CMC Microsystems in Canada. This work led to a capabilities model in OpenCL 3.0, because it was necessary as a hardware abstraction layer for my work. I had a secondary agenda with OpenCL 3.0. It was a strategic foundation, designed for the machine learning model I was designing to automatically adapt software to arbitrary hardware. I knew that eventually I’d have to get hardware vendors to ship drivers for my platform, and convincing them to do so would be hard, so I guided the OpenCL specification to require it.
I was failing and learning and improving. But I knew that software was not enough. I had to learn how hardware works. I had to understand the physical reality that heterogeneous specifications like OpenCL and CUDA were attempting to target via their abstractions. I knew these abstractions were the source of the hardware-software mismatch, but how can I find a better abstraction if I don’t understand what the abstractions are targeting? I reached out and met Paul Chow at the University of Toronto, a former postdoc student of John Hennessey at Stanford who contributed to RISC in the 80s, and is an expert in reconfigurable computing. Paul began to mentor me on how to build a valuable technology business, and how hardware really works under the hood.
I met I-Cheng Chen, an AMD Fellow. I-Cheng was able to give me access to AMD GPUs I otherwise couldn’t afford for experimentation. He met me for coffee every Friday on campus, after my library studies, to answer my questions about how GPUs really work and how the semiconductor business works. He helped me learn all about chiplets and packaging technology, and what the world would be like beyond Moore’s Law. Thanks to I-Cheng I prototyped all of my ideas on AMD GPUs.
I had an occasional lunch with Michael Wong who helped create modern C++ and parallel programming, to help me understand memory consistency models.
I was invited to join the Fields Institute for mathematical research as a visitor, which gave me an office and an opportunity to work with brilliant mathematicians to understand the theoretical limits of parallel programming. I wanted to understand if all programs can benefit from execution on GPUs and parallel systems, or if some applications will forever be sequential. This gave me the background in theoretical mathematics I needed to truly understand the work of the pioneers of computing.
I tried and failed to convince the startup community, and VCs, about the impending end of Moore’s Law and the critical need for my work in heterogeneity. I gave public talks with dire warnings about what would happen when compute performance simply stopped, and how the difficulty of programming heterogeneous accelerators threatened the entire industry. I was invited to join a startup accelerator but was told there was no market for heterogeneity. No role for GPUs in machine learning. There was no problem with heterogeneity. I was thrown out of one accelerator because, when asked to disclose sensitive IP to a group of VCs and research scientists, I refused without an NDA to protect my startup. I applied for startup funding from the government, with a letter of support from both NVIDIA and AMD explaining the critical importance of my work to reduce energy, and this work was rejected as being of insufficient quality for innovation funding. These failures helped me learn a lot about business, marketing, and execution. I knew that the end of Moore’s Law was inevitable.
Throughout these failures, each teaching me important lessons, I remained committed to my mission of “software lives forever,” meaning that once an application is written it should continue to run efficiently no matter how hardware changes underneath. I had to find a way to decouple software logic from the hardware to prepare for the day that hardware would be in continuous flux.
I could have given up. I thought about it seriously. I sent out resumes sometimes, and interviewed to consider a traditional software engineering career. It was very tempting to abandon this idea of finding a solution to heterogeneous computing. I have a very deep understanding of the entire compute stack, from the language, to the compiler, to the operating system, to the hardware and underlying mathematical theory that powers all of it.
But then I realized something. I figured out why every approach to parallel programming, and automatic parallelization had failed. I realized why my own efforts had failed.
Everything about how we write and execute software is based upon the underlying theory of Alan Turing and John von Neumann. Alan Turing gave us Turing Machines and Turing Completeness, an all powerful computing abstraction for software that can conceptually solve any problem you pose. John von Neumann gave us the von Neumann architecture, which is the organization of memory in a conceptual computer.
I asked myself a simple question: do programmers actually want to write an infinite loop? What if Turing Completeness was a mistake? What if our programming models are impractical because they are too powerful, they let us write programs that nobody actually wants to write. Programs with security bugs. Programs that take exponential time to sort a list?
What would happen if we threw away this foundation and built a new one? What if we built a new computing model that was not Turing Complete, and not von Neumann? What if we built it from the ground up to be analyzable, like SQL? Suddenly, things became very clear. If we could analyze code, we could predict the cost of arbitrary code, understand required resources, and how to map to arbitrary accelerators. We can then use machine learning to automatically learn and adapt to arbitrary processors, to give us plug & play heterogeneity where a system can transparently and automatically adapt any software to any processor no matter how exotic, without source changes.
I began to design the Toronto model of computing, named after the University of Toronto. The Toronto model is a theoretical model of a computer. Software, instead of targeting a von Neumann machine with Turing Completeness, would target an intentionally Turing Incomplete model built for practical applications that people actually write. Suddenly, there was a moment of perfect clarity. This was why every approach to automatic parallelization had failed. This is why software is inherently insecure. This is why it’s so hard to get software that is correct.
A theoretical model might be interesting to publish academically, but this was not my mission. This was my competitive edge for a company. I then began to design a new programming language, based on this model, a new compiler, a new operating system, and a new hardware abstraction layer. It would absolutely be cost prohibitive to rebuild the entire compute stack if we did it with the methods we have today. But I had my training in machine learning, and I looked for clever ways to reduce the amount of code that had to be written. It became clear that we needed an adaptive software-hardware interface, that can continuously learn how hardware behaves. This is what I designed to be placed on top of OpenCL 3.0.
Once I had a solution, I needed help to critique it. I brought technical advisors into the company to find issues with my work. The list of technical advisors kept growing as I actively looked for problems with my approach, and then corrected them. I brought business advisors into the company to help me design a business, explore revenue model ideas, and strategies.
I began to work with Shane Peelar, an advisor who began to work nights and weekends after his day job, to prove the model really worked. In Manhattan Project terms, we just figured out fission. But does it work? We had to build a nuclear pile to prove the theory really works in reality. We implemented a simple programming language, assembly code, and a simulated processor based on the principles of the Toronto model. The first time I ran it, I cannot describe how it felt. I felt like I just saw an atom being split. It worked! We could now, instantly, analyze an arbitrary piece of code and output how long it would take to execute statically. I was taught this was impossible, but there it was running. Automatic parallelization followed. For the first time, we can write sequential code and automatically parallelize for any target hardware. The same code can run on a GPU, CPU, or any processor that exists now or that can possibly exist in future, no matter how exotic. This is what would have saved Itanium.
We now have a full system architecture based upon the Toronto model. We know how to build a new operating system that can seamlessly run the same code on a single node, a compute cluster, or distributed system without modification. We know how to build the compiler. We know how to modify programming languages to make them analyzable. We know how to build an optimizing compiler that automatically learns and adapts without human intervention to target any processor. We’re ready to implement the full vision with a proven theory and a prototype that validates it works.
After a long quest of deep research, I have emerged from my time away to find that today’s world is stuck in the thinking of Turing Completeness and the von Neumann architecture. CUDA, ROCm, and OpenCL are all based on these principles. I really wondered if the world would accept a fundamentally new approach to software?
Today the world is facing an energy crisis due to AI’s insatiable demand for compute. Moore’s Law is dead. The world needs a solution right now. The computing apocalypse for which I have been preparing all my life is here. The world has crashed into the “Heterogeneity Wall” and is now asking itself the same questions that I asked in 2010. How do we program all of these different processors?
To usher in a new era in computing, and to prove that we have fundamentally solved the challenge of automatic parallelization, I needed help. Nobody is going to believe me that automatic parallelization is possible, especially when I spent so long building a model in secrecy and intentional obscurity to solve the problem.
I am no longer alone or isolated in bringing this technology to the world.
Maurice Herlihy’s theories of parallelism helped to create our modern world of distributed systems. He is a leading expert in parallel programming, and a 3x winner of the Dijkstra prize. Maurice has not only validated this work, he is coming with me to advocate that automatic parallelization is now possible when we meet with potential investors. Maurice validates that the theory upon which the Toronto model is based is sound.
Michael Wong is an expert in parallel programming, and helped create modern parallel technology including SYCL, OpenMP, and C++. Mike has joined YetiWare as CTO, and is leading our discussions with potential investors and sponsors of the project.
Tim Mattson, who helped create OpenCL and OpenMP, a hero from my youth when I began this journey whose writings and talks inspired me on this path, is helping me explain this technology in a clear and concise way.
Changing the world is very hard, and I needed help from a business leader who has done it time and time again. Nicholas Donofrio, a former IBM executive who was at the forefront of innovation at IBM and shaped our modern technology world, motivated us to keep going as we persevered. We have shown the demo to Nick, and he validated that there is an immense business opportunity here that reminded him of the IBM mainframe in its ability to keep software running. Customers will save money by using our platform and not rewriting their code continuously. My life’s mission of “software lives forever” really resonated with Nick. I am deeply honored by his support of this mission, and his help with the business opportunity.
It is no longer time for more research. Now it is the time for us to implement the system and deploy it. Now is the time to build software for a post-Moore’s Law world. We have pivoted to pure engineering. The long summer days are over for unicorns. Now is the harsh winter of diminishing returns of hardware, and escalating complexity of software due to the apocalypse of heterogeneity. Now is the time for a yeti.
This is the launch of YetiWare.