Automating data transformation

harder than it sounds

Mateusz Kubuszok

About me

breaking things in Scala for 10 years
a little bit of open source - including co-authoring Chimney for over 8 years now
blog at Kubuszok.com
niche Things you need to know about JVM (that matter in Scala) ebook

Agenda

What transformations are we talking about? Where are they used?
How they can be generated?
What they have to do with certain mathematical concepts?
If we can use math to define the rules of transformation, does it make it easy to implement?

What transformations are we talking about?

API → Domain → DB

// Our logic is define with these
case class UserId(value: Long) extends AnyVal
case class UserData(name: String, age: Int)
case class User(id: UserId, data: UserData, updatedAt: Instant)

// API is defined with these:
case class UserApi(name: String, age: Int)

def parseUser(json: String): Either[String, UserApi] = ...

// Database persistence is defined with these:
case class UserDB(id: Long, name: String, age: Int, updatedAt: Instant)

def persistUser(user: UserDB): Either[String, Unit] = ...

// Update endpoint logic of some POST /api/users/{userId}
def updateUser(userId: Long, userBody: String): Either[String, Unit] =
  parseUser(userBody)
    .map { (api: UserApi) =>
      ??? : User  // 1. Use userId and userApi to create a User
    }
    .map { (user: User) =>
      user // 2. Maybe some domain operations?
    }
    .map { (user: User) =>
      ??? : UserDB // 3. Use user to create a UserDB
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }

def updateUser(userId: Long, body: String): Either[String, Unit] =
  parseUser(body)
    .map { (api: UserApi) =>
      // UserApi -> User
      User(
        UserId(userId), UserData(api.name, api.age), Instant.now())
    }
    .map { (user: User) =>
      user
    }
    .map { (user: User) =>
      // User -> UserDB
      UserDB(
        user.id.value, user.data.name, user.data.age, user.updatedAt)
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }

What if number of fields grows?
What if they would have similar, primitive types (e.g. String, Int, Boolean)?
What if we reorder the fields?

case class UserData(
  firstName: String,
  lastName: String,
  age: Int,
  email: String,
  phone: String,
  address: String,
  city: String
)

(api: UserApi) =>

 UserData(
  api.lastName, api.firstName, api.age, //<-- lastName & firstName reordered
  api.phone, api.email, api.address, api.city //<-- phone & email reordered
 )

Maybe code-gen?

If source value and target type, has a field with the same name, it would be used by default
If for every field in the target type, there is a source value resolved, the transformation can be generated
Anything extraordinary would have to be explicitly enabled by the programmer in the code

import io.scalaland.chimney.dsl._ // <-- some code-gen library

def updateUser(userId: Long, body: String): Either[String, Unit] =
  parseUser(body)
    .map { (api: UserApi) =>//  UserApi -> UserData is generated
                           // ↓    with default configuration
      User(UserId(userId), api.into.transform[UserData], Instant.now())
    }
    .map { (user: User) =>
      user
    }
    .map { (user: User) =>
         //       Fields from user: User
        //   AND field from user.data: UserData
       // ↓     are used to generate UserDB
      user.into[UserDB].withFallback(user.data).transform
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }

Declarative programming

case class RowFoo(a: Int, b: Int, c: Int, d: Int, e: Int)

case class RowBar(a: Int, b: Int, c: Int, d: Int, e: Int,
                  g: String, h: String)

val foo: RowFoo = ...
val bar = RowBar( // Verbose, repetitive, "imperative",
  a = foo.a,              // a lot of "noise"...
  b = foo.b,
  c = foo.c,
  d = foo.d,
  e = "overridden",
  g = "hardcoded",
  h = ((foo.a + foo.b) * foo.c).toString
)

val foo: RowFoo = ...
val bar = foo.into[RowBar] // "Give me a transformation from Foo to Bar
  .withFieldConst(_.e, "overridden") // WHERE e is mapped to [provide value]
  .withFieldConst(_.g, "hardcoded")  // AND g is mapped to [provided value]
  .withFieldComputed(_.h, f => ((f.a + f.b) * f.c).toString) // AND h is ...
  .transform // AND for everything else, use a field with correspoding name"

How can such transformations be generated?

Scala - macros, metaprogramming (Chimney, Ducktape, Henkan)
Java - annotation processors, runtime reflection, bytecode generation (MapStruct, Dozer, ModelMapper, JMapper, …)
C# - reflection.emit, expression trees (AutoMapper, ExpressMapper, …)
Go - "struct to struct copying" Copier
Swift - runtime reflection (Converter)

Since in Scala we have types, that not always are classes, and surely almost never are Java Beans, we type mappers or data mappers.

In Java they are usually known as Bean mappers.

In C# I am only vaguely aware of the mechanism, AutoMapper seem to impose bigger performance penalty than ExpressMapper, due to how it works, but I am not sure.

Go - Copier is not strictly speaking mapper but whatever.

Swift’s Converter is a runtime-reflection based mapper inspired by AutoMapper.

Some solutions are based on runtime reflection - even if they are generating fast bytecode in the runtime - they can fail only at runtime and you have to run the code to learn that.

Some are able to fail at the compile time, so they might generate performance code AND let you know that something is wrong before you run the code.

That’s why I said that we would focus on the statically types languages, because certain principled approaches are only possible if you know enough ahead of time.

In this presentation, we will NOT focus on any particular solution to code generation, even if I prefer to show examples of API in Scala.

I will not dig into Scala macros, Java’s/C#/Swift reflection, or runtime vs compile time etc.

I will only focus on what any such tool, would have to do, how it should compare the shapes of the data, to generate implementation of mapping between types.

But a moment ago we mentioned, something about declarative programming.

Let’s digress and talk about SQL.

Relational algebra

"second half of 19th century" - calculus of relations created by A. De Morgan, C.S. Pierce, E. Schröder
1940 - A. Tarski, proposes first axiomatization
1970 - A relational model of data for large shared data banks, E.F. Codd
1970s - SQL, Donald D. Chamberlin, Raymond F. Boyce

Records

\[S = \{(s_{j1}, s_{j2}, ..., s_{jn})|\ j = 1,...,m\}\]

m - number of "rows"
n - number of "columns"

\[h = (h_1, h_2, ..., h_n)\]

\[h_i((s_{1j}, ..., s_{nj})) = s_{ij}\]

\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]

\[h_U = (id, name, age)\]

\[u \in U\]

\[id(u) = 1\]

\[name(u) = John\]

\[age(u) = 20\]

Selection

\[\phi - predicate\]

\[\sigma_{\phi}(S) = \{s \in S|\ \phi(s)\}\]

SELECT *
FROM S s
WHERE phi(s) -- <-- this is selection

\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]

\[\phi(u) = age(u) > 20\]

\[\sigma_{\phi}(U) = \{(2, Jane, 21), (3, Jim, 22)\}\]

SELECT *
FROM U u
WHERE u.age > 20

Projection

\[\{a_1,...,a_m\} \subseteq \{h_1, ..., h_n\}\]

\[\Pi_{a_1,...,a_m}(S) = \{(a_1(s), ..., a_m(s))|s \in S\}\]

SELECT s.a1, s.a2, ..., s.an -- <-- this is projection
FROM S s

\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]

\[\Pi_{age,name}(U) = \{(20, John), (21, Jane), (22, Jim)\}\]

SELECT u.age, u.name
FROM U u

Rename

\[\rho_{a/b}(S) = (\{ h_1, ..., h_n\} \setminus \{a\}) \cup \{b\}\]

SELECT s.a1, s.a2 AS new_name, ..., s.an -- <-- this is rename
FROM S s

Cartesian product

Provided that \(h_S\) and \(h_T\) are disjoint

\[S \times T = \{(s, t) | s \in S, t \in T\}\]

\[h_{S \times T} = \{h_{S1}, ..., h_{Sn}\} \cup \{h_{T1}, ..., h_{Tm}\}\]

SELECT *
FROM S s, T t -- or CROSS JOIN

(Natural) Join

\(h_S\) and \(h_T\) may have an overlap:

\[O = \{h_{S1}, ..., h_{Sn}\} \cap \{h_{T1}, ..., h_{Tm}\}\]

Then

\[S \bowtie T = \{(s, t) | s \in S, t \in T, \forall_{o \in O}\ o(s) = o(t)\}\]

\[S \bowtie T = \{(s, t) | s \in S, t \in T\}\]

\[\rho_{a/b}(S) = (\{ h_1, ..., h_n\} \setminus \{a\}) \cup \{b\}\]

SELECT *
FROM S s
NATURAL JOIN T t

There are also:

\(\theta\)-join
inner join
left outer join
right outer join
full outer join
etc

Union

\[R \cup S = \{x|x \in R \vee x \in S\}\]

(SELECT * FROM S s)
UNION
(SELECT * FROM T t)

Weren’t we supposed to talk about transformations?

Transformations and relational operations

Projection

SELECT src.e, src.d, src.c, src.b, src.a -- order we want in the result
FROM Source src -- order in Source table is irrelevant

case class Source(a: Int, b: Int, c: Int, d: Int, e: Int)

case class Target(e: Int, d: Int, c: Int, b: Int, a: Int)

val src: Source = ...
val tgt = src.into[Target].transform
       // Target(e = src.e, d = src.d, c = src.c, b = src.b, a = src.a)

SELECT src.e, src.d, src.c, src.b, src.a -- extra fields are ignored
FROM Source src -- have an additional columns f and g

case class Source(a: Int, b: Int, c: Int, d: Int, e: Int, f: Int, g: Int)
                                                       // ↑ extra fields
case class Target(e: Int, d: Int, c: Int, b: Int, a: Int)

val src: Source = ...
val tgt = src.into[Target].transform
       // Target(e = src.e, d = src.d, c = src.c, b = src.b, a = src.a)

Renames

SELECT src.e, src.d, src.c AS z -- column c is renamed to z
FROM Source src -- has attribute c

case class Source(a: Int, b: Int, c: Int)

case class Target(a: Int, b: Int, z: Int)
                               // ↑ renamed field

val src: Source = ...
val tgt = src.into[Target]
            .withFieldRenamed(_.c, _.z) // <-- c AS z
            .transform
       // Target(a = src.a, b = src.b, z = src.c)

Cartesian product

SELECT foo.a, foo.b, foo.c, bar.d, bar.e
FROM Foo foo
CROSS JOIN Bar bar -- Cartesian product

case class Foo(a: Int, b: Int, c: Int)

case class Bar(d: Int, e: Int)

case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)

val foo: Foo = ...
val bar: Bar = ...
val baz = foo.into[Baz]
            .withFallback(bar) // <-- CROSS JOIN
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = bar.d, e = bar.e)

SELECT foo.a, foo.b, foo.c, provided.d, provided.e
FROM Foo foo
CROSS JOIN (SELECT 1 as d, 2 as e) provided

case class Foo(a: Int, b: Int, c: Int)

case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)

val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldConst(_.d, 1)
            .withFieldConst(_.e, 2)
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = 1, e = 2)

SELECT foo.a, foo.b, foo.c, 1 AS d, 2 AS e -- use constants without a table
FROM Foo foo

case class Foo(a: Int, b: Int, c: Int)
                                    // ↓ extra fields
case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)

val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldConst(_.d, 1)
            .withFieldConst(_.e, 2)
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = 1, e = 2)

SELECT foo.a, foo.b, foo.c, custom_function(foo.a, foo.b) AS d
FROM Foo foo

case class Foo(a: Int, b: Int, c: Int)
                                    // ↓ extra field
case class Baz(a: Int, b: Int, c: Int, d: Int)

val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldComputed(_.d, foo => custom_function(foo.a, foo.b))
            .transform
       // Baz(a = foo.a, b = foo.b, c = custom_function(foo.a, foo.b))

These were quite easy.

Let’s try something harder.

Enums

enum UserType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
}

val userType: UserType = ...
userType match {
  case UserType.Admin => ...
  case UserType.Moderator(roles) => ...
  case UserType.User(verified) => ...
}

enum UserType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
}

enum UserDBType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
  case Suspended // <-- extra case
}

val userType: UserType = ...
userType match {
  case UserType.Admin => UserDBType.Admin
  case UserType.Moderator(roles) => UserDBType.Moderator(roles)
  case UserType.User(verified) => UserDBType.User(verified)
  // nothing is matched into Suspended and that's ok
}

-- we can use unique attribute names to disntinct from which case they came
SELECT ut.type, null as roles, null as verified FROM UserType ut WHERE ut.type = 'Admin'
UNION
SELECT ut.type, ut.roles, null as verified FROM UserType ut WHERE ut.type = 'Moderator'
UNION
SELECT ut.type, null as roles, ut.verified FROM UserType ut WHERE ut.type = 'User'

Nested data

case class OuterFoo(inner: InnerFoo)
case class InnerFoo(a: Int, b: Int, c: Int)

case class OuterBar(inner: InnerBar)
case class InnerBar(a: Int, b: Int, c: Int)

val foo: OuterFoo = ...

val baz = foo.into[OuterBar].transform // <--is this still principled?

OuterBar(
  inner = foo.inner.into[InnerBar].transform
)
// or
OuterBar(
  inner = {
    val inner = foo.inner
    InnerBar(a = inner.a, b = inner.b, c = inner.c)
  }
)
// or
OuterBar(
  inner = InnerBar(a = foo.inner.a, b = foo.inner.b, c = foo.inner.c)
)

SELECT foo.`inner.a` AS `inner.a`,
       foo.`inner.b` AS `inner.b`,
       foo.`inner.c` AS `inner.c`
FROM OuterFoo foo

It looks too easy, so let’s complicate it a bit.

case class OuterFoo(inner: InnerFoo)
case class InnerFoo(a: Int, b: Int)

case class OuterBar(inner: InnerBar)
case class InnerBar(c: Int, d: Int)

case class OuterBaz(inner: InnerBaz)             // ↓ extra field
case class InnerBaz(a: Int, b: Int, c: Int, d: Int, e: Int)

val foo: OuterFoo = ...
val bar: OuterBar = ...

val baz = foo.into[OuterBaz]
   .withFallback(bar) // <-- CROSS JOIN / Cartesian product
   .withFieldConst(_.inner.e, 1) // <-- extra field
   .transform

SELECT foo.`inner.a` AS `inner.a`,
       foo.`inner.b` AS `inner.b`,
       bar.`inner.c` AS `inner.c`,
       bar.`inner.d` AS `inner.d`
       1             AS `inner.e`
FROM OuterFoo foo
CROSS JOIN OuterBar bar

Optional values

case class Source(value: Int)

case class Target(value: Option[Int])

val source: Source = ...
val target = source.into[Target].transform
          // Target(value = Some(source.value))

case class Source2(value: Option[Int])

case class Target2(value: Int)

val source2: Source2 = Source(None) // <-- cannot be unwrapped!
val target2 = ???

Embeddings

\[E_A = A \cup \{Empty,\ ErrorMsg,\ Exception \}\]

// pseudo-code
class Empty
class ErrorMsg
class Exception

type E[A] = A | Empty | ErrorMsg | Exception

val result: E[Target2] = ...

result match {
  case e: Empty => ...
  case e: ErrorMsg => ...
  case e: Exception => ...
  case success: Target2 => ...
}

val source2: Source2 = Source(None) // <-- cannot be unwrapped!
val result: partial.Result[Target2] = source2.intoPartial[Target2].transform

result match { // <-- actually, implemented as enum (sealed trait)
  case partial.Result.Value(target2) => ... // <-- no runtime failures
  case partial.Result.Errors(errors) => ... // <-- 1 or more errors
}

We are allowed to do something called embedding.

We create a superset of our algebra, that is extended with some extra values.

Then we can use the extra values to handle additional cases.

For instance we can extend our type to handle: empty Option, some error messages if we needed to call smart constructor to transform, or if there is some throwing API.

We can fail compilation if someone tries to use plain transformation for something that requires runtime checks, and pass it for fallible transformation where possible runtime failures are reflected in the type.

This is really handy, because we often would have to convert values from DTO to domain and back again, and one of these operations is always safe, and the other works with unsanitized data.

So we need to somehow handle "encoding" domain data and "decoding" into it.

Different naming conventions

case class Source(someName: String, anotherName: String) // camel case

case class Target(some_name: String, another_name: String) // snake case

Equivalence relation

symmetric - \(a \equiv b \iff b \equiv a\)
reflexive - \(a \equiv a\)
transitive - \(a \equiv b \land b \equiv c \implies a \equiv c\)

If we make values of difference cases equivalent, we can merge them into a single equivalence class.

\[\{someName, some\_name, SomeName, ...\}\]

\[\{anotherName, another\_name, AnotherName, ...\}\]

// Define equivalence relation via a predicate and use it for mapping
case object DifferenceCasesAreEqual extends TransformedNamesComparison {

  def namesMatch(fromName: String, toName: String): Boolean = ...
}

case class Source(someName: String, anotherName: String) // camel case

case class Target(some_name: String, another_name: String) // snake case

val source: Source = ...
val target = source.into[Target]
  .enableCustomFieldNameComparison(DifferenceCasesAreEqual)
  .transform
// Target(some_name = source.someName, another_name = source.anotherName)

So far:

resolving values for fields can follow some reasonable rules expressed declaratively
similarly enums/coproducts
and nested data
unavoidable runtime errors can be handled in a type-safe way
we can provide missing values in a principled way as well

Where is the catch?

Real World

Fields vs normal vals

case class Source(a: Int, b: String, c: Double)

case class Target(a: Int, b: String, c: Double)

val source: Source = ...
val target = source.into[Target].transform
          // Target(a = source.a, b = source.b, c = source.c)

abstract class Parent {

  val a: Int = 1 // a is declared as val, but in parent!
}
case class Source(d: Int) extends Parent {

  def b: String = "value" // b is declared as method, not field!

  val c: Double = 1.0 // c is declared as val, but not in "fields" list!
}

case class Target(a: Int, b: String)

val source: Source = ...
val target = source.into[Target].transform
          // Target(a = source.a, b = source.b, c = source.c)

flattening

case class User(meta: UserMetadata, data: UserData)
case class UserMetadata(id: Long, createdAt: Instant)
case class UserData(name: String, age: Int)

case class UserDto(id: Long, name: String, age: Int, createdAt: Instant)

val user: User = ...

val userDto = ??? // <-- should we automatically flatten user?

Java Beans

case class User(name: String, age: Int)

class UserBean {
  private var name: String = uninitialized
  def getName(): String = name
  def setName(name: String): Unit = this.name = name

  private var age: Int = uninitialized
  def getAge(): Int = age
  def setAge(age: Int): Unit = this.age = age
}

val user: User = ...
// Requires mapping:
//  - setName(user.name)
//  - setAge(user.age)
val userBean: UserBean = user.into[UserBean].transform

val userBean: UserBean = ...
// Requires mapping:
//  - name = userBean.getName()
//  - age = userBean.getAge()
val user: User = userBean.into[User].transform

trait NamedSet {

  // name of the set value belongs to
  def setName: String = "User"
}

case class User(name: String, age: Int) extends NamedSet {

  // value belongs to User set
  override def setName: String = "User"
}

class UserBean {
  private var name: String = uninitialized
  def getName(): String = name
  def setName(name: String): Unit = this.name = name

  private var age: Int = uninitialized
  def getAge(): Int = age
  def setAge(age: Int): Unit = this.age = age
}

AnyVal and Option

case class Source(id: String, data: String) // Newtype-like ↓
case class SourceOptional(source: Option[Source]) extends AnyVal

case class Target(id: String, data: String)

val source: Source  = ...

source.into[Target].transform
// Target(id = source.id, data = source.data)

source.into[Option[Target]].transform
// Some(Target(id = source.id, data = source.data))

Option(source).into[Option[Target]].transform
// Option(source).map(src => Target(id = src.id, data = src.data))

Option(source).intoPartial[Target].transform.asOption
// Some(Target(id = source.id, data = source.data))

Option.empty[Source].intoPartial[Target].transform.asOption
// None

case class Source(id: String, data: String) // Newtype-like ↓
case class SourceOptional(source: Option[Source]) extends AnyVal

case class Target(id: String, data: String)

val source: Source  = ...

// What does it do?
SourceOptional(Some(source)).intoPartial[Option[Target]].transform.asOption
SourceOptional(None).intoPartial[Option[Target]].transform.asOption
// hint: type is Option[Option[Target]]

// Expected:
Some(Some(Target(id = source.id, data = source.data)))
Some(None)

// Actual:
Some(Some(Target(id = source.id, data = source.data)))
None

Summary

when we have schemas and source and target are similar, we can define some principled way to map between them
but users might have expectations about the types, which might require creativity in defining the rules
in RL code there will be cases with no single obvious behavior
in RL code there will be feature interaction bugs

Automating data transformation

About me

Agenda

What transformations are we talking about?

API → Domain → DB

Maybe code-gen?

Declarative programming

How can such transformations be generated?

Relational algebra

Records

Selection

Projection

Rename

Cartesian product

(Natural) Join

Union

Transformations and relational operations

Projection

Renames

Cartesian product

Enums

Nested data

Optional values

Embeddings

Different naming conventions

Equivalence relation

Where is the catch?

Real World

Fields vs normal vals

flattening

Java Beans

AnyVal and Option

Summary

Thank you!