Preserves: An Expressive Data Language

78 points by mpweiher 5 months ago

So `#` starts a line comment but `#t` is a boolean. Yeah, that's never gonna hurt anyone.

gotta have some examples on the landing before the fold, otherwise i have no reason to explore

mhitza 5 months ago

https://preserves.dev/TUTORIAL.html
Not a fan of annotation (that can be used as comment syntax) having # followed by a space character have a different behaviour feels strange.
porcoda 5 months ago

Tutorial link too hard to click? Seriously: these comments inevitably come up for most language related pages, and I don’t see how they don’t fall under the site guidelines of “Please don't post shallow dismissals”.
- yencabulator 5 months ago
  
  The comment specifically said "on the landing before the fold", so yours is the shallow dismissal. I agree with the comment; lead with the things we shall judge ye by.
- paddy_m 5 months ago
  
  I made a similar comment. I think they come up because people are genuinely interested in a project and trying to offer the creator a fresh perspective. When creating a project where your the domain expert it's so easy to get stuck in your own head, and then you start explaining the project to a newcomer diving into deep details when they don't understand the starting point.
  Until a project has a lot of traction (think docker, react, django not uv or jq) it's very safe to assume that every visitor to your page doesn't understand the background.

boxed 5 months ago

> This is a good time to mention that even though from a semantic perspective sets and dictionaries do not carry information about the ordering of their elements.

Except they do in Python. It is extremely useful, surprisingly often.

tonyg 5 months ago

Python remembers order, and exposes it in its iterations, but doesn't use it in its equivalence over dictionaries (== semantics).
(ETA: What are you quoting there? I don't think that text appears on the Preserves site) (ETA2: Ah, it's the tutorial. Cool)
- sixdimensional 5 months ago
  
  Not sure about the previous post, but also python’s OrderedDict collection guarantees order-sensitive equality checks. [1]
  Plain dict maintains insertion order but equality checks only check that the key/value pairs are the same. [2] [3]
  [1] https://docs.python.org/3/library/collections.html#:~:text=e...
  [2] https://docs.python.org/3/library/stdtypes.html#:~:text=dict...
  [3] https://docs.python.org/3/library/stdtypes.html#:~:text=dict...
- boxed 5 months ago
  
  Yea. Just pointing out that the order can be significant, so throwing it away isn't generally a good idea.
tmvphil 5 months ago

Only dictionaries. Sets are still unordered. As a person who was just burned.

fjfaase 5 months ago

Is it true that Records are the same as Dictionaries, because the labels in the records can have any value?

Interesting on how on one hand the size of SignedInteger is unlimited, but on the other hand there is a ByteString. A ByteString could also have been represented by as a sequence of SignedInteger. I also wonder if it would not better to have a Unicode character as an atomic unit and represent a string as a sequence of Unicode characters.

This makes me wonder whether this is a high-level data model or yet another data representation.

tonyg 5 months ago
No, a record is a tagged (sequence of) value(s).
```
  <tag v1 v2 v3>
```
If you put a single dictionary-valued "field" in a record, you get a variation with named fields
```
  <tag {
    field1: value1
    field2: value2
    field3: value3
  }>
```
Records have positional "fields" because of the Scheme heritage of the design.
--
Re bytestring -- yes there are some concessions to real machines/languages in there that aren't absolutely required. Other examples include booleans and strings, which could have been <true> and <false> and <string [65 66 67]> etc respectively.
There's a little more on this topic in footnote 2 on the "conventions" page: https://preserves.dev/conventions.html#fn:why-dictionaries
- carterschonwald 5 months ago
  
  Looks like a lot of excellent work!
  Are there any good examples of nontrivial schemas etc?
  
  tonyg 5 months ago
  
  Thanks. Yes there are: see sections 15, 16 and 17 of https://synit.org/book/, where Preserves, Preserves Schemas, and the Syndicated Actor Model make a reactive replacement system layer for linux (essentially an alternative to systemd)

djoldman 5 months ago

Here is the ABNF:

https://preserves.dev/preserves-text.html

tonyg 5 months ago

Or, in "quick reference card" form: https://preserves.dev/cheatsheet.html
The syntax isn't the most interesting part though; the thing that distinguishes it from most other data languages out there is that it has semantics (= a rigorous definition of when values are equal and when they aren't). So you can use Preserves semantics with JSON syntax (a subset of Preserves' text syntax) as one way of getting actually-meaningful JSON.
Plus, comments (and other annotations) ;-)

yegle 5 months ago

Does Preserves have a page with a comparison to other common serialization languages?

As someone familiar with Protobuf, comparing Preserves vs Protobuf text format, here's my quick comparison between the two after reading through the tutorial:

- Preserves' Symbol is very close to Protobuf enums. But Symbol can contain characters like dash

- There doesn't seem to be an equivalent of Preserves' Record in Protobuf, but the tutorial's example of using <Unknown ...> To denote a missing <Date ...> can be simulated using the `oneof` field in Protobuf.

- Having to write #t/#f in Preserves is unfortunate. I guess this is the result of schemaless serialization language and potential parsing ambiguity with a Symbol?

- Protobuf have a way to annotate the schema and reuse at runtime, very similar to Preserves' annotations.

skybrian 5 months ago

Protobufs are designed to support schema evolution without explicit versioning. (All fields are optional so they can be added or dropped, provided field numbers aren’t reused.)
It looks like Preserves just uses version numbers in its schemas. On the other hand, you can read the data without a schema, similar to JSON.
- tonyg 5 months ago
  
  The version number is the schema language version, not the version of the collection of types described in the file.
  The schema language is extensible/evolvable in that pattern matching ignores extra entries in a sequence and extra key/value pairs in a dictionary. So you could have a "version 1" of a schema with
  Person = <person @name String> .
  and a "version 2" with
  Person = @v2 <person @name String @address Address> / @v1 <person @name String> .
  Then, Person.v2 from "version 2" would be parseable by Person from "version 1", and Person from "version 1" would parse using "version 2" as a Person.v1.
  The schema language is in production but the design is still a work in progress and I expect more changes before a 1.0 release of the schema language.
  (The schema language is completely separate from the preserves data model, by the way -- one could imagine other schema languages being used instead/as well)
  
  skybrian 5 months ago
  
  Thanks for the clarification! That sounds about as evolvable as JSON or any system that uses string keys (like HTTP headers).
  Protobufs have an extra level of indirection built in: code refers to fields using names, but numbers are sent on the wire. Without convenient access to field numbers, they can’t as easily be hard-coded. This also strongly encourages using the schema file for most tasks. With protobufs (or similar), any user-friendly editor will need a schema to make sense of the data.
  JSON-like systems and protobufs have opposite design goals: encouraging versus discouraging schemaless data access.
  
  tonyg 5 months ago
  
  There are no string keys in the Person example above. You could add some, though, or use numbers instead with the same host-language API:
  Person = <person @name String @address Address>
  as above, or
  Person = <person { @name "name": String @address "address": Address }>
  or
  Person = { @name 1: String @address 2: Address }
  etc. all produce the same host-language record, e.g. in TypeScript
  export type Person = { name: String, address: Address, };

lionkor 5 months ago

Why/when/where would I need this?

paddy_m 5 months ago

I was going to ask the exact same question. The title makes it sound like something I might be interested in, then I visited the page and I have no idea what it does.
After some brief reading of docs, I'm trying to write one sentence explanations. Maybe this will be helpful to you
What
Preserves is a specification and set of libraries in popular languages that lets you reliably exchange data between XML, JSON and EDN.
Who
Preserves is built for (data engineers|data framework writers) to reliably interchange data.
Why
Formats like JSON in particular are imprecise. Preserves forces you to deal with these vagaries up front
What else?
With P-Expressions you can search a preserve compliant datasource much like you would query JSON with JQ
Who Not? Who shouldn't use this
This will not help a data analyst exchange data between CSV and Excel
- lionkor 5 months ago
  
  Thank you! That makes more sense now
  
  paddy_m 5 months ago
  
  I have know idea if the project author would agree with those sentences, I was just proposing them.
tonyg 5 months ago

Useful if you have a JSON-keyed table, for example: JSON lacks a useful (standardised) equivalence relation, meaning you get weak and/or implementation-specific guarantees about how key lookup works. Equivalences were the motivation for developing Preserves: I was (and am) working on a generalized approach to messaging middleware, you might say, meaning that things like "patterns over values" and "filters" and "value-keyed tables" are all things I need to talk about. (This all comes out of RabbitMQ/AMQP thinking back in the day and my PhD-and-after work subsequently.)

layer8 5 months ago

Looking at the headings in that TOC, “Preserves“ is a bit of an unfortunate naming choice grammatically.

tonyg 5 months ago

Yeah I struggle with "Preserve" vs "Preserves" sometimes. Was there something in particular that struck you as unfortunate, though?
- layer8 5 months ago
  
  "Preserves <something>", for example "Preserves data", reads like "it preserves data". Probably less so in the middle of a sentence, due to the uppercasing, but in the TOC it reads like bullet points enumerating what is preserved.
  
  tonyg 5 months ago
  
  Thank you! I wonder if something a bit contrived such as small-caps could help. I'll experiment.
  
  layer8 5 months ago
  
  Putting Preserves in italics would be an alternative.
  The name nevertheless feels awkward to me, also in spoken conversation. A made-up word like maybe “Pres” or “Edal” (from “expressive data language”) would work better IMO.

conartist6 5 months ago

CSTML is targeting many of the same weaknesses in JSON. It's fun to see a whole different, competing set of design choices at work. I had a very different take on schema validation and how to use the < syntax.

tonyg 5 months ago

Do you have a link for CSTML, please? Googling is showing a bunch of possibilities none of which look quite relevant enough to be right...
- conartist6 5 months ago
  
  https://github.com/bablr-lang/
  https://bablr.org/playground
  
  tonyg 5 months ago
  
  Thank you!

ceving 5 months ago

I can not find an Emacs mode. Does anybody know one?

tonyg 5 months ago

https://gitlab.com/preserves/preserves/-/blob/main/preserves... -- crude but effective! (I use it all the time)

account-5 5 months ago

Might be an ignorant question but why not just use XML? It seems like XML could do all this, from my limited reading?

mhalle 5 months ago

XML would require a schema to express the concepts in Preserves, or JSON for that matter.
The reason JSON is lower friction than XML for data representation is that you get basic data representations (numbers, strings, arrays, maps) for free in a natural native syntax that happens to parallel multiple programming languages.
XML, in contrast, is a meta-language that allows schema to express different data representations. You've got to use attributes and elements to represent data and data types. XSD is a common datatype schema, but it's quite verbose, and data serialization looks very different from what it looks like in a programming language representation.
Preserves looks like a superset of JSON. It includes additional data representation concepts through syntax extensions, but the idea is the same.
What I don't see is a standard way to map record types (like "irl" in the tutorial) to a unique identifier like an URI/IRI, or something like a CURIE. That kind of feature would allow Preserves to better describe standardized record types.
tonyg 5 months ago

XML would make a fine choice. It lacks atomic data types other than text, and compound data types other than sequences, unless you count element attributes, which are in a kind of awkward position because of the historical development of the language. Preserves has a richer suite of primitive data types and decomposes XML's elements into separate notions of map, sequence, and tagged value.

mcphage 5 months ago

It seems like it doesn't natively support decimals (needed for financial data) or any sort of date or datetime?

tonyg 5 months ago

Date and datetime are left to convention: https://preserves.dev/conventions.html#dates-and-times
Decimals I'm on the fence about. Some discussion here: https://gitlab.com/preserves/preserves/-/issues/10

twism 5 months ago

So EDN?

tonyg 5 months ago

Yeah EDN is quite similar. Preserves has no nil, allows any value as a tag, gets into the weeds more on when strings are equal or not, doesn't distinguish lists and vectors, and doesn't require each kind of tagged element to define an equivalence. And it has annotations (vs EDN's comments) and embedded values.