Show HN: Globstar – Open-source static analysis toolkit

91 points by sanketsaurav a day ago

Hey HN! We’re Jai and Sanket, co-founders of DeepSource (YC W20). We're open-sourcing Globstar (https://github.com/DeepSourceCorp/globstar), a static analysis toolkit that lets you easily write and run custom code quality and security checkers in YAML [1] or Go [2].

After 5+ years of building AST-based static analyzers that process millions of lines of code daily at DeepSource, we kept hearing a common request from customers: "How do we write custom checks specific to our codebase?" AppSec and DevOps teams have a lot of learned anti-patterns and security rules they want to enforce across their orgs, and being able to do that without being a static analysis expert, came up as an important want.

We initially built an internal framework using tree-sitter [3] for our proprietary infrastructure-as-code analyzers, which enabled us to rapidly create new checkers. We realized that making the framework open-source could solve this problem for everyone.

Our key insight was that writing checkers isn't the hard part anymore. Modern AI assistants like ChatGPT and Claude are excellent at generating tree-sitter queries with very high accuracy. We realized that the tree-sitters' gnarly s-expression syntax isn’t a problem anymore (since the AI will be doing all the generation anyway), and we can instead focus on building a fast, flexible, and reliable checker runtime around it.

So instead of creating yet another DSL, we use tree-sitter's native query syntax. Yes, the expressions look more complex than simplified DSLs, but they give you direct access to your code's actual AST structure – which means your rules work exactly as you'd expect them to. When you need to debug a rule, you're working with the actual structure of your code, not an abstraction that might hide important details.

We've also designed Globstar to have a gradual learning curve: The YAML interface works well for simple checkers, and the Go Interface can handle complex scenarios when you need features like cross-file analysis, scope resolution, data flow analysis, and context awareness. The Go API gives you direct access to tree-sitter bindings, so you can write arbitrarily complex checkers on day one.

Key features:

- Written in Go with native tree-sitter bindings, distributed as a single binary

- MIT-licensed

- Write all your checkers in a “.globstar” folder in your repo, in YAML or Go, and just run “globstar check” without any build steps

- Multi-language support through tree-sitter (20+ languages today)

We have a long way to go and a very exciting roadmap for Globstar, and we’d love to hear your feedback!

[1] https://globstar.dev/guides/writing-yaml-checker

[2] https://globstar.dev/guides/writing-go-checker

[3] https://tree-sitter.github.io/tree-sitter/

markrian 20 hours ago

Interesting! Do you have a page which compares globstar against other similar tools, like Semgrep, ast-grep, Comby, etc?

For instance, something like https://ast-grep.github.io/advanced/tool-comparison.html#com....

  • sanketsaurav 18 hours ago

    Not at the moment, but we'll put something up soon.

    We're focused on keeping globstar light-weight, so a hosted runtime is not in the roadmap (although we'll add support for running Globstar checkers natively on our commercial product DeepSource). You should be able to write any checkers in Globstar that you can write in the other tools you've listed.

    Our goal is to make it very easy to write these checkers — so we'd be optimizing the runtime and our Go API for that.

xxpor 21 hours ago

Another rule engine checker that doesn't support the language that needs this type of thing the most: C

In this case, it's inexplicable to me since tree-sitter supports C fine.

micksmix 14 hours ago

One of the main benefits of Semgrep is its unified DSL that works across all supported languages. In contrast, using the Go module "smacker/go-tree-sitter" can expose you to differences in s-expression outputs due to variations and changes in independent grammars.

I've seen grammars that are part of "smacker/go-tree-sitter" change their syntax between versions, which can lead to broken S-expressions. Semgrep solves that with their DSL, because it's also an abstraction away from those kind of grammar changes.

I'm a bit concerned that tree-sitter s-expressions can become "write-only" and rely on the reader to also understand the grammar for which they've been generated.

For example, here's a semgrep rule for detecting a Jinja2 environment with autoescaping disabled:

  rules:
  - id: incorrect-autoescape-disabled
    patterns:
      - pattern: jinja2.Environment(... , autoescape=$VAL, ...)
      - pattern-not: jinja2.Environment(... , autoescape=True, ...)
      - pattern-not: jinja2.Environment(... , autoescape=jinja2.select_autoescape(...), ...)
      - focus-metavariable: $VAL

  
Now, compare it to the corresponding tree-sitter S-expression (generated by o3-mini-high):

  (
    call
      function: (attribute
                  object: (identifier) @module (#eq? @module "jinja2")
                  attribute: (identifier) @func (#eq? @func "Environment"))
      arguments: (argument_list
                    (_)*
                    (keyword_argument
                      name: (identifier) @key (#eq? @key "autoescape")
                      value: (_) @val
                        (#not-match @val "^True$")
                        (#not-match @val "^jinja2\\.select_autoescape\\("))
                    (_)*)
  ) @incorrect_autoescape

People can disagree, but I'm not sure that tree-sitter S-expressions as an upgrade over a DSL. I'm hoping I'm proven wrong ;-)
  • codelion 13 hours ago

    That's a really interesting breakdown of the DSL vs. S-expression approach. I can see your point about the potential fragility of relying directly on tree-sitter outputs, especially with grammar drift. It took me a while to wrap my head around the S-expression syntax when I first started using tree-sitter, so I appreciate the comparison to a more human-readable DSL like Semgrep's.

    The other benefit of a DSL like Semgrep's is that LLMs have become very good at generating it. See https://github.com/lambdasec/autogrep on how to automatically generate Semgrep rules from existing CVEs.

  • sanketsaurav 14 hours ago

    > One of the main benefits of Semgrep is its unified DSL that works across all supported languages.

    > People can disagree, but I'm not sure that tree-sitter S-expressions as an upgrade over a DSL.

    100% agree — a DSL is a better user experience for sure. But this is a deliberate choice we made of not inventing a new DSL and using tree-sitter natively. We've directly addressed this and agree that the S-expressions are gnarly; but we're optimizing for a scenario that you wouldn't need to write this by hand anyway.

    It's a trade-off. We don't want to spend time inventing a DSL and port every language's idiosyncrasies to that DSL — we'd rather improve our runtime and add support for things that other tools don't support, or support only on a paid tier (like cross-file analysis — which you can do on Globstar today).

    • micksmix 11 hours ago

      That makes a lot of sense. I wish you the best of luck and will be happy to try it out as you continue to develop it!

micksmix 15 hours ago

This is a really interesting project!

I'd love to hear how this project differs from Bearer, which is also written in Go and based on tree-sitter? https://github.com/Bearer/bearer

Regardless, considering there is a large existing open-source collection of Semgrep rules, is there a way they can be adapted or transpiled to tree-sitter S-expressions so that they may be reused with Globstar?

  • sanketsaurav 14 hours ago

    Thanks!

    > I'd love to hear how this project differs from Bearer, which is also written in Go and based on tree-sitter? https://github.com/Bearer/bearer

    The primary difference is that we're optimizing for users to write their custom rules easily. We do plan to ship built-in checkers [1] so we cover at least OWASP Top 10 across all major programming languages. We're also truly open-source using the MIT license.

    > Regardless, considering there is a large existing open-source collection of Semgrep rules, is there a way they can be adapted or transpiled to tree-sitter S-expressions so that they may be reused with Globstar?

    I'm pretty sure there should be a way to make that work. We believe writing checkers (and having a long list of built-in checkers) will be a commodity in a world where AI can generate S-expressions (or tree-sitter node queries in Go) for any language with very high accuracy (which is where we have an advantage as compared to tools that use a custom DSL). To that extent, we're focused on improving the runtime itself so we can support complex use cases from our YAML and Go interfaces. If the community can help us port rules from other sources to our built-in checkers, we'd love that!

    [1] https://github.com/DeepSourceCorp/globstar/pulls

etyp 19 hours ago

I really love that static analyzers are pushing in this direction! I loved writing Clippy lints and I think applying that "it's just code" with custom checks is a powerful idea. I worked on a static analysis product and the rules for that were horrible, I don't blame the customers for not really wanting to write them.

Is there a general way to apply/remove/act on taint in Go checkers? I may not be digging deeply enough but it seems like the example just uses some `unsafeVars` map that is made with a magic `isUserInputSource` method. It's hard for me to immediately tell what the capabilities there are, I bet I'm missing a bit.

  • sanketsaurav 18 hours ago

    Thanks! We still have a long way to go and a pretty extensive roadmap.

    > Is there a general way to apply/remove/act on taint in Go checkers? I may not be digging deeply enough but it seems like the example just uses some `unsafeVars` map that is made with a magic `isUserInputSource` method. It's hard for me to immediately tell what the capabilities there are, I bet I'm missing a bit.

    Assuming you're looking at the guide [1], the `isUserInputSource` is just a partial example and not a magic method (we probably should have used a better example there).

    The AST for each node along with the context are exposed in the `analysis.Pass` object [2]. We don't have an example for taint analysis, but here's an example [3] of state tracking that can be used to achieve this. This is a little tedious at the moment and you'll have to do the heavy-lifting in the Go code — but this is on our roadmap to improve. We want to expose a lot more helpers to make doing things like taint analysis easily.

    Here's another idea [4] we're exploring to make the YAML interface more powerful: adding support for utilities (like entropy calculation) that you can call and perform a comparison.

    [1] https://globstar.dev/guides/writing-go-checker#_1-complex-pa...

    [2] https://globstar.dev/reference/checker-go#analysis-function

    [3] https://globstar.dev/reference/checker-go#state-tracking

    [4] https://github.com/DeepSourceCorp/globstar/issues/27

  • injuly 19 hours ago

    Flow analysis, especially propagation, is a hard problem to solve in the general case. IMO, the one tool that had the best, if language-specific, approach was Pyre – Facebook's type checker and static analyzer for Python.

pdimitar 21 hours ago

Wow this looks great. I will be giving it a go VerySoon™!

Looking forward to writing some enhanced linters.

codepathfinder 20 hours ago

Nothing comes closer to CodeQL!

If anyone is interested please checkout, codepathfinder.dev, truly opensource CodeQL alternative.

Feedbacks are appreciated!

  • injuly 19 hours ago

    Admirable effort :)

    But in its current state I don't think it actually replaces any of CodeQL's use cases. The most straight forward way to do what CodeQL does today, would to be implement a flow analysis IR (say CFG+CallGraph) on top of tree-sitter.

    Even the QL grammar itself can be in tree-sitter.

    • codepathfinder 18 hours ago

      Thanks for the feedback. That's the exact plan :raised_hands:

      current state of codepathfinder is less than 5% of what codeql has implemented. As security engineer, I personally use it and i'll keep adding + closing the gap.

      Feel free to contribute ideas/feedback/bugs. Super appreciable honestly!