I've been reading a couple of books recently, centred around data, data storage, distributed systems and scaling applications. Particularly:

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

The book starts by running through some of the challenges of data storage, databases and the different approaches to databases. It's been an ambition of mine, to understand a bit more about data storage and databases in general. So I'm going to attempt to implement some of the ideas in these few chapters in particular over the course of several blog posts, as I build on the knowledge. This first post will be very brief, and probably not that interesting, but I need a starting point to build on, so bear with!

I'm going to start by creating a simple key/value API, which takes some raw data, stores it to disk, and allows the user to read the value back from disk again. I'll iterate through various improvements and optimisations, such as creating indexes using B-Trees, things like write-head logs, hash tables, compression/formats, networking etc, in future posts.

Getting started

Let's start with a simple test case, and a couple of benchmarks as well, so we can measure our optimisations:

package main

import (
	"testing"

	"github.com/stretchr/testify/require"
)

func roundTrip() (interface{}, error) {
	db := NewDatabase(".data")
	if err := db.Write("key", "value"); err != nil {
		return nil, err
	}

	return db.Get("key")
}

func TestCanStoreValue(t *testing.T) {
	result, err := roundTrip()
	require.NoError(t, err)
	require.Equal(t, result, "value")
}

func benchmarkRoundtrip(num int, b *testing.B) {
	for i := 0; i < num; i++ {
		_, _ = roundTrip()
	}
}

func BenchmarkRoundtrip1(b *testing.B) {
	num := 1
	for i := 0; i < b.N; i++ {
		benchmarkRoundtrip(num, b)
	}
}

func BenchmarkRoundtrip100(b *testing.B) {
	num := 100
	for i := 0; i < b.N; i++ {
		benchmarkRoundtrip(num, b)
	}
}

Now for the code

Okay, cool! You'll notice we've defined a bit of an API in our tests. We have a NewDatabase constructor, which takes a 'path'. The path is the location where the data is persisted on disk. Then, of course, we have a Write method and a Get. Pretty straight-forward, we write a value to disk with a given key, then we can use that key to fetch the value back again.


package main

import (
	"encoding/json"
	"errors"
	"fmt"
	"io/ioutil"
	"log"
	"os"
)

// NewDatabase -
func NewDatabase(path string) *Database {
	// If the path already exists, skip the error, that's okay
	// If the path doesn't exist, i.e no error, create it.
	if err := os.Mkdir(path, 0755); err != nil {
		if !errors.Is(err, os.ErrExist) {
			log.Panic(err)
		}
	}
	return &Database{path}
}

// Database -
type Database struct {
	path string
}

func (d *Database) getKey(key string) string {
	return fmt.Sprintf("%s/%s", d.path, key)
}

// Write -
func (d *Database) Write(key string, value interface{}) error {
	encoded, err := json.Marshal(value)
	if err != nil {
		return err
	}

	if err := ioutil.WriteFile(d.getKey(key), encoded, 0666); err != nil {
		return err
	}

	return nil
}

// Get -
func (d *Database) Get(key string) (interface{}, error) {
	data, err := ioutil.ReadFile(d.getKey(key))
	if err != nil {
		return nil, err
	}

	var r interface{}
	if err := json.Unmarshal(data, &r); err != nil {
		return nil, err
	}

	return r, nil
}

func main() {
	db := NewDatabase(".data")

	if err := db.Write("key", "value"); err != nil {
		log.Panic(err)
	}

	val, err := db.Get("key")
	if err != nil {
		log.Panic(err)
	}

	log.Println(val)
}

This approach is incredibly crude; we're encoding values into JSON, writing it to disk with the key as the file name. We're using the key to fetch the contents back again and load it back into memory.

Conceptually this is how a database works. It takes values; it writes them to disk and provides a way of loading the data back again.

Why this isn't a valid database, yet

This approach is missing several key components and has some glaringly obvious flaws. First of all, this only works locally, databases need a way to run on a network; to take queries, send data over the wire and be accessible to other networked services. Secondly, we're parsing data into JSON and storing it in an encoded string format. Whilst this is incredibly simple, JSON/string data has a lot of formatting and encoding overhead. In other words, at scale, this approach would use vast amounts of space.

Most database engines will use some compression or raw binary encoding format. Think MongoDB's BSON, for example. This set-up also isn't thread safe, we need to implement a lock to prevent multiple concurrent access to our database files.

We're also storing single values in single files, which won't scale. If you have a datastore containing millions of rows, that's millions of files and inodes. Typically database engines have a way of compressing and concatenating values into a more efficient format on disk, which I'll attempt to unpack (literally), in a future post.

Summary

So there we have it, a crude, definitely not for production use, data store. Next post, we'll attempt to create a binary format on disk and replace the JSON files.

First iteration benchmark results

goos: darwin
goarch: amd64
BenchmarkRoundtrip1-12              7286            162186 ns/op
BenchmarkRoundtrip100-12              80          13959178 ns/op
PASS
ok      _/Users/ewanvalentine/development/database/basic-map    3.527s

Sponsor me on Patreon to support more content like this.