¶A map of the tree-sitter ecosystem
2021-06-14
The tree-sitter ecosystem is divided up across a large number of components, each in different repositories, which can be quite overwhelming at first. This post tries to provide a map of sorts.
¶Overview
Say you’re interested in the tree-sitter project, so you decide to check out the tree-sitter organization on GitHub, browsing through its repositories to determine how the ecosystem is structured. The list of repositories spills over onto a second page, and you see entries that seem redundant. Why is there both tree-sitter-python and py-tree-sitter? Are they competing with each other? Is one deprecated?
You might instead decide to check out the project homepage. The landing page lists (as of June 2021) over 40 different programming language parsers that various folks have implemented, as well as a handful of language bindings.
This, at least, points to an answer. The tree-sitter ecosystem is complicated because when we write a code analysis tool, we want to support different programming languages in two separate, orthogonal ways:
- First, we want to be able to parse source code implemented in different programming languages.
- Second, and possibly less obviously, we want to use tree-sitter in several different programming languages. You specifically are going to write your analysis tool in one language, but we (the tree-sitter developers) don’t know which one that is! We’ve tried to implement tree-sitter so that we don’t place any restrictions on which language you use.
That at least explains why “Python support” in tree-sitter might mean two different things. But why have we separated everything out into distinct repositories? The main reason is to make it as clear as possible that all of these pieces are truly independent of each other. There shouldn’t be any way for the Python language bindings to influence the design or release process of the Haskell bindings, for instance, nor of any of the language grammars.
True, it adds complexity to the ecosystem, but we’ve tried to get around this with careful naming conventions, and tree-sitter-specific tooling to make it easy to find and work with whatever pieces you need.
So, given the above, you will encounter all of the following on your journey:
¶Language parsers
You must have a tree-sitter grammar for each language that you want to parse. Each language grammar is typically implemented in a its own repository, named ‘tree-sitter-$LANGUAGE’.
There are some exceptions. For instance, the tree-sitter-javascript repository lets you parse JavaScript and JSX — although in this case, this is handled with a single grammar that treats “plain JavaScript” as a file that happens to not have any JSX expressions in it. Similarly, the tree-sitter-typescript repository lets you parse TypeScript and TSX, though in this case, they’re handled with distinct grammars. All of these grammars share enough structure, and are a coherent enough family of languages, that it would be overkill to separate them out further.
¶The tree-sitter runtime library
The generated parsers only contain some state tables describing the language being parsed. The “meat” of the parsing logic is implemented in the tree-sitter runtime library, which each parser depends on. This runtime library is also where tree-sitter’s query language is implemented.
The runtime library is implemented in the tree-sitter/tree-sitter repository on GitHub, under the lib/include and lib/src directories.
¶Language bindings
The runtime library and each generated parser are implemented in C. Assuming that you aren’t writing your analysis tool in C, you will need bindings for the language that you are using. This will use your language’s FFI mechanism to link in the tree-sitter C code and make it available using more idiomatic constructs.
The Rust and WASM bindings are considered “tier 1”, and are implemented directly in the tree-sitter/tree-sitter repository.
Rust binding implementation (tree-sitter/tree-sitter)
WASM binding implementation (tree-sitter/tree-sitter)
Other bindings (such as for Python or Haskell) are implemented in separate repositories, typically named ‘$LANGUAGE-tree-sitter’.
Haskell binding implementation (tree-sitter/haskell-tree-sitter)
¶Language parser bindings
Complicating things even more, you need both the runtime library and the generated parser for each language that you want to parse — and in particular, you need bindings for both! The language bindings described above only include the runtime library, since they can’t know in advance which languages you will want to parse. The bindings should include instructions for how to build and include your desired parsers.
For some language bindings, we can lean on the language’s package manager for this. For instance, for the Rust bindings, we publish packages to crates.io both for the language binding itself (the tree-sitter crate) and for most of the supported grammars (e.g. the tree-sitter-python crate). So if you are writing a tool, which is implemented in Rust, and which analyzes Python code, you would add both tree-sitter and tree-sitter-python to your Cargo.toml file. Wherever possible, we follow this approach for other language bindings, too.