Flattening JSON using RTTI
The JSON data format is everywhere. JavaScript Object Notation is
used to serialize (and deserialize) data structures; converting
binary data in memory to text, and vice versa. JSON is a pretty cool
format to work with, but it’s not great on the command-line or in
shell scripts. The utility jq comes to the rescue, but alternatively
I just wanted to have “flat” key-value pair file that you can awk
and grep on. So why not flatten the JSON and do just that.
JSON is easy for dynamically typed scripting languages (such as JavaScript, Python), but things get a little more difficult when the language has strict typing.
Golang has neat builtin support for loading JSON data using struct tagging. It’s quite a unique feature, here is an example of what that looks like:
type Album struct {
Artist string `json:"artist"`
Title string `json:"title"`
Year int `json:"year,omitempty"`
}
JSON data often has a nested structure, in which case you should
also use nested structs. But that’s all assuming you know the
data format beforehand. If you wish to load arbitrary JSON data,
you’re going to have to do some magic tricks.
Strict typing
JSON keys are strings; JSON values can be of type string, number, boolean, null, or it can be an array, or it can be an object—which is a mapping of key-value pairs, where the value can again be any of these types.
JavaScript is dynamically typed, and as a consequence JSON arrays and objects can hold mixed types. You would expect strictly typed languages to struggle with this. In the end it’s not that big of a problem, as we shall see, but let me illustrate that it can be a hassle depending on how you approach the problem.
Go has type constraints, so you might want to write:
type JSON interface {
bool | float64 | string | []JSON | map[string]JSON
}
This doesn’t work however, you can’t make a recursive type constraint; you can’t make a “generic generic”.
So then you might try writing:
type JSON interface {
bool | float64 | string | []any | map[string]any
}
Only to find that this creates more problems down the line.
As soon as type any pops up in a constraint, the best you can do is
simplify and call everything any, rather than trying to fit it
to a constraint.
That said, we can make type aliases, but only to prettify the code a little.
type JSON = any
type JSONBool = bool
type JSONNumber = float64
type JSONString = string
type JSONArray = []JSON
type JSONObject = map[string]JSON
The type aliases don’t do anything substantial (it’s just search/replace in the source code), and I don’t use them in the code shown below.
Loading type any
We can load arbitrary JSON data (from stdin) into a map with value-type
any.
First we load the raw data, next unmarshal that data into the
map structure. Unmarshalling is a more dignified term for deserializing.
// load data
fd := bufio.NewReaderSize(os.Stdin, 64*1024)
data, err := io.ReadAll(fd)
if err != nil {
log.Fatal(err)
}
/*
`data_map` holds data by key.
The values may be of any (JSON) type.
*/
var data_map map[string]any
err = json.Unmarshal(data, &data_map)
Printing type any
To print the the map we just loaded, visit every key and print its
corresponding value. But that value is of type any, and it can even
be an array, or it can be another (sub)map (!) So, we have some recursion
going on.
func PrintFlatJSON(prefix string, data map[string]any) {
if prefix != "" {
prefix += "."
}
for k,v := range data {
key := prefix + k
printKeyValue(key, v)
}
}
Now, we wish to print the value of something that is of type any.
In order to print that, we must inspect the type using RTTI
(Run-Time Type Information). Go has a quirky syntax for RTTI—with some
imagination, it looks like an inverse typecast.
func printKeyValue(key string, value any) {
switch v := value.(type) {
case bool:
fmt.Printf("%s %t\n", key, v)
case float64:
fmt.Printf("%s %f\n", key, v)
case string:
fmt.Printf("%s %q\n", key, v)
(This may also be written as a multi-case switch statement and using
the "%v" format string).
You may have noticed the lack of integers, and indeed we don’t have
a case int here. While it looks on the surface like JSON supports
integers, in reality JavaScript only has the Number type, which
internally are floats. Therefore what happens when we load JSON data
in Go? We get floats, not integers (!)
If you insist on integers, then this funky hack will do the trick:
if v - float64(int(v)) == 0.000000 {
fmt.Printf("%s %d\n", key, int(v))
} else {
fmt.Printf("%s %f\n", key, v)
}
A JSON value may be null. The equivalent in Go is nil, but strangely
enough nil is not a Go type. However, we may still use nil to inspect
the type:
case nil:
fmt.Printf("%s null\n", key)
For printing JSON arrays, iterate over the array and recurse:
case []any:
printArray(key, v)
...
func printArray(prefix string, data []any) ) {
var key string
for i := range data {
key = fmt.Sprintf("%s[%d]", prefix, i)
printKeyValue(key, data[i])
}
}
Likewise, for printing objects, iterate over the map and recurse. Printing a JSON object was actually our initial toplevel function, so:
case map[string]any:
PrintFlatJSON(key, v)
Finally, there is a default case that should never happen:
default:
panic("unexpected type in unmarshalled JSON data")
Quod erat demonstrandum
Fitting all this together, we can finally demonstrate the outcome. Let’s say we have a nested JSON structure such as (just an example):
{
"artist": "Iron Maiden",
"albums": [
{
"title": "Powerslave",
"year": 1984
},
{
"title": "Somewhere In Time",
"year": 1986
},
{
"title": "Seventh Son of a Seventh Son",
"year": 1988
}
]
}
The little Go program flattens this to:
artist "Iron Maiden"
albums[0].title "Powerslave""
albums[0].year 1984
albums[1].title "Somewhere In Time"
albums[1].year 1986
albums[2].title "Seventh Son of a Seventh Son"
albums[2].year 1988
If that looks a bit underwhelming, realize that it requires run-time type information to make it happen. You can do the same thing in C/C++, but there you have to emulate the RTTI by handcoding it yourself.