The COPIOUS Engineering team spends a lot of time with new languages and technologies, and is always concerned with security from step one. Brian Shirai is on our partner EngineYard's team, and maintains a programming language platform called Rubinius. He spoke at our last TechTalk, and presented solid language design patterns to COPIANS and members of the local open source communities here in Portland.
Some programmers are content to craft their creations with the languages they are given. They see the language as a tool and become expert at wielding it to perform their duties. For others, designing the language itself has tremendous allure. Some will discover Lisp and spend the rest of their days in a zen-like state transcending syntax while others will battle the syntax dragon to their last breath.
Whether your ambition is to create the next great programming language or merely to gain deeper insight into the various languages you encounter, designing your own language is an eminently practical endeavor. That's because computer security and languages are intimately related. If you've been anywhere around Rails lately, you have seen security occupying a lot of attention and developer time updating applications.
If you only watch one tech video this year, please let it be this one: 28c3: The Science of Insecurity. I'll summarize the main points to relate it back to the idea of designing a programming language:
- Language grammars can be divided into classes.
- Two of those classes have a very good property: For any input, you can determine whether it is well-formed.
- Those two classes are the regular languages and the unambiguous context-free languages.
- Security is a matter of ensuring that all protocols for communicating between components have the property of being verifiable in finite time.
- The corollary is that if a protocol cannot be verified in finite time, it will take you an infinite amount of time to secure your system.
- Consequently, design your protocols with regular or unambiguous context-free grammars and generate the recognizers mechanically from the grammars.
This video examines conventional wisdom about why programs are insecure. Conventional wisdom assigns security concerns to a multitude of factors. As the video highlights, when an explanation is too complicated, that is a good indicator we don't know the problem well. The explanation for security given in the video is one that is amenable to our tools of mathematics and scientific inquiry: secure-able protocols prevent machines or software from being put into unanticipated states.
That short digression into security is to emphasize that language design is a vital concern for all programmers. Designing a language isn't a fanciful hobbyist activity; it has important real-world application.
Implementing A Language
Whether designing a general purpose or specialized language, you have many decisions to make to implement the language. For specialized languages, often referred to as internal or external domain-specific languages (DSLs), you will typically implement it with another high-level language. However, even a general purpose language can be implemented in a high-level language. Here is a Scheme implementation by James Coglan that is implemented in Ruby. A common approach for implementing a language this way involves parsing your language text into an abstract syntax tree (AST) and then walking the nodes of the tree evaluating them in terms of the host language. This is called an AST-walking interpreter.
A good beginner to intermediate reference for language implementation is Language Implementation Patterns by Terence Parr.
Instead of using another programming language to implement your language, you could target a computing machine. We are familiar with the idea of a CPU in computer runs programs, but there are also virtual machines that have instructions sets in a similar manner. To implement your language, you translate the syntax to a sequence of these instructions. The machine executes the instructions to run your program.
A broad reference for many types of virtual machines is Virtual Machines: Versatile Platforms for Systems and Processes by Smith and Nair.
An important component of virtual machines is the memory management component or garbage collector. This is the preeminent reference on garbage collection technology: The Garbage Collection Handbook: The Art of Automatic Memory Management by Jones et al.
Some people come for the language design and stay for the virtual machine, garbage collector, or compiler technology. Each of these are vast and fascinating areas of computer science. However, if you want to get your programming language running as quickly as possible, you'll want to use the tools that an existing language platform can provide. Let's look at the Rubinius language platform.
The Rubinius Language Platform
Rubinius is an implementation of the Ruby programming language, as well as a flexible programming language platform. Ruby itself is a wonderful, powerful system language that you can use to implement your own language. Additionally, Rubinius exposes many essential features of a language platform in Ruby, making it very easy to get started implementing your language.
Rubinius provides a bytecode virtual machine, a modern generational garbage collector, and a just-in-time (JIT) compiler that converts bytecode into machine code using LLVM for improved performance. The bytecode compiler for Ruby is written entirely in Ruby and provides a rich example of how to target the Rubinius virtual machine.
Additionally, there are already a few intriguing languages that target Rubinius. One of these is Fancy by Christopher Bertels, which is a "general-purpose programming language inspired by Smalltalk, Ruby, Io and Erlang". Another is Atomy by Alex Suraci, which is a "programmable programming language" that is influenced by Common Lisp, Clojure, Haskell, Ruby, Slate and Atomo, Potion and Poison, and Erlang. If you have heard that good macro support requires all those Lisp parentheses, I urge you to check out Atomy.
With Rubinius, you can choose to implement your language using Ruby, much like Heist does, or you can target the bytecode machine directly, like Fancy and Atomy do. If you go the bytecode route, you can re-use much of what we have written for Ruby, like the AST and bytecode compiler architecture. For an example of this, refer to the Fancy source code. For many more examples of targeting Rubinius as a language platform, check out the Rubinius projects page.
Rubinius also provides features like a built-in source code debugger, profiler, memory analysis tools, and a component for monitoring the virtual machine. We are continually improving Rubinius and these tools to provide the best possible experience for implementing a programming language.
There is never a better time to get started designing your own programming language. The resources referenced above have a wealth of information to help you. The two most essential tasks you will engage in are writing a parser for your language and some sort of compiler or translator to run it.
Parsers are an area of computer science that have been researched for decades. However, there is plenty of work left to do. In 2004, a new kind of parser, called a parsing expression grammar (PEG), was proposed by Bryan Ford. PEG grammars have the nice property that if a string is recognized, the AST representing it is unambiguous. Evan Phoenix (you may recognize him as the guy who created the Rubinius project) has written a full-featured PEG parser generator called kpeg. This is the parser generator used for the Atomy parser.
Writing a compiler is usually seen as an especially daunting task. Unfortunately, that is both true and not so true. The Rubinius bytecode itself, and the many projects targeting Rubinius, give many examples. For those who want to read more about compilers, an excellent reference for writing a compiler is Engineering a Compiler, 2nd Edition by Cooper and Torczon.
I won't mislead you. Designing a programming language and implementing it can be a lot of work. I promise you, though, the journey is fascinating and immensely rewarding. So, what are you waiting for? If you have any questions, I am @brixen on Twitter or come hang out with us in the #rubinius IRC channel on freenode.net.