Automating data transformation

harder than it sounds


Mateusz Kubuszok

About me

Agenda

  • What transformations are we talking about? Where are they used?

  • How they can be generated?

  • What they have to do with certain mathematical concepts?

  • If we can use math to define the rules of transformation, does it make it easy to implement?

What transformations are we talking about?

API → Domain → DB

// Our logic is define with these
case class UserId(value: Long) extends AnyVal
case class UserData(name: String, age: Int)
case class User(id: UserId, data: UserData, updatedAt: Instant)
// API is defined with these:
case class UserApi(name: String, age: Int)

def parseUser(json: String): Either[String, UserApi] = ...
// Database persistence is defined with these:
case class UserDB(id: Long, name: String, age: Int, updatedAt: Instant)

def persistUser(user: UserDB): Either[String, Unit] = ...
// Update endpoint logic of some POST /api/users/{userId}
def updateUser(userId: Long, userBody: String): Either[String, Unit] =
  parseUser(userBody)
    .map { (api: UserApi) =>
      ??? : User  // 1. Use userId and userApi to create a User
    }
    .map { (user: User) =>
      user // 2. Maybe some domain operations?
    }
    .map { (user: User) =>
      ??? : UserDB // 3. Use user to create a UserDB
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }
def updateUser(userId: Long, body: String): Either[String, Unit] =
  parseUser(body)
    .map { (api: UserApi) =>
      // UserApi -> User
      User(
        UserId(userId), UserData(api.name, api.age), Instant.now())
    }
    .map { (user: User) =>
      user
    }
    .map { (user: User) =>
      // User -> UserDB
      UserDB(
        user.id.value, user.data.name, user.data.age, user.updatedAt)
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }
  • What if number of fields grows?

  • What if they would have similar, primitive types (e.g. String, Int, Boolean)?

  • What if we reorder the fields?

case class UserData(
  firstName: String,
  lastName: String,
  age: Int,
  email: String,
  phone: String,
  address: String,
  city: String
)
(api: UserApi) =>

 UserData(
  api.lastName, api.firstName, api.age, //<-- lastName & firstName reordered
  api.phone, api.email, api.address, api.city //<-- phone & email reordered
 )

Maybe code-gen?

  • If source value and target type, has a field with the same name, it would be used by default

  • If for every field in the target type, there is a source value resolved, the transformation can be generated

  • Anything extraordinary would have to be explicitly enabled by the programmer in the code

import io.scalaland.chimney.dsl._ // <-- some code-gen library

def updateUser(userId: Long, body: String): Either[String, Unit] =
  parseUser(body)
    .map { (api: UserApi) =>//  UserApi -> UserData is generated
                           // ↓    with default configuration
      User(UserId(userId), api.into.transform[UserData], Instant.now())
    }
    .map { (user: User) =>
      user
    }
    .map { (user: User) =>
         //       Fields from user: User
        //   AND field from user.data: UserData
       // ↓     are used to generate UserDB
      user.into[UserDB].withFallback(user.data).transform
    }
    .flatMap { (userDB: UserDB) => persistUser(userDB) }

Declarative programming

case class RowFoo(a: Int, b: Int, c: Int, d: Int, e: Int)

case class RowBar(a: Int, b: Int, c: Int, d: Int, e: Int,
                  g: String, h: String)
val foo: RowFoo = ...
val bar = RowBar( // Verbose, repetitive, "imperative",
  a = foo.a,              // a lot of "noise"...
  b = foo.b,
  c = foo.c,
  d = foo.d,
  e = "overridden",
  g = "hardcoded",
  h = ((foo.a + foo.b) * foo.c).toString
)
val foo: RowFoo = ...
val bar = foo.into[RowBar] // "Give me a transformation from Foo to Bar
  .withFieldConst(_.e, "overridden") // WHERE e is mapped to [provide value]
  .withFieldConst(_.g, "hardcoded")  // AND g is mapped to [provided value]
  .withFieldComputed(_.h, f => ((f.a + f.b) * f.c).toString) // AND h is ...
  .transform // AND for everything else, use a field with correspoding name"

How can such transformations be generated?

  • Scala - macros, metaprogramming (Chimney, Ducktape, Henkan)

  • Java - annotation processors, runtime reflection, bytecode generation (MapStruct, Dozer, ModelMapper, JMapper, …​)

  • C# - reflection.emit, expression trees (AutoMapper, ExpressMapper, …​)

  • Go - "struct to struct copying" Copier

  • Swift - runtime reflection (Converter)

But a moment ago we mentioned, something about declarative programming.

Let’s digress and talk about SQL.

Relational algebra

  • "second half of 19th century" - calculus of relations created by A. De Morgan, C.S. Pierce, E. Schröder

  • 1940 - A. Tarski, proposes first axiomatization

  • 1970 - A relational model of data for large shared data banks, E.F. Codd

  • 1970s - SQL, Donald D. Chamberlin, Raymond F. Boyce

Records

\[S = \{(s_{j1}, s_{j2}, ..., s_{jn})|\ j = 1,...,m\}\]
  • m - number of "rows"

  • n - number of "columns"

\[h = (h_1, h_2, ..., h_n)\]
\[h_i((s_{1j}, ..., s_{nj})) = s_{ij}\]
\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]
\[h_U = (id, name, age)\]
\[u \in U\]
\[id(u) = 1\]
\[name(u) = John\]
\[age(u) = 20\]

Selection

\[\phi - predicate\]
\[\sigma_{\phi}(S) = \{s \in S|\ \phi(s)\}\]
SELECT *
FROM S s
WHERE phi(s) -- <-- this is selection
\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]
\[\phi(u) = age(u) > 20\]
\[\sigma_{\phi}(U) = \{(2, Jane, 21), (3, Jim, 22)\}\]
SELECT *
FROM U u
WHERE u.age > 20

Projection

\[\{a_1,...,a_m\} \subseteq \{h_1, ..., h_n\}\]
\[\Pi_{a_1,...,a_m}(S) = \{(a_1(s), ..., a_m(s))|s \in S\}\]
SELECT s.a1, s.a2, ..., s.an -- <-- this is projection
FROM S s
\[U = \{(1, John, 20), (2, Jane, 21), (3, Jim, 22)\}\]
\[\Pi_{age,name}(U) = \{(20, John), (21, Jane), (22, Jim)\}\]
SELECT u.age, u.name
FROM U u

Rename

\[\rho_{a/b}(S) = (\{ h_1, ..., h_n\} \setminus \{a\}) \cup \{b\}\]
SELECT s.a1, s.a2 AS new_name, ..., s.an -- <-- this is rename
FROM S s

Cartesian product

Provided that \(h_S\) and \(h_T\) are disjoint

\[S \times T = \{(s, t) | s \in S, t \in T\}\]
\[h_{S \times T} = \{h_{S1}, ..., h_{Sn}\} \cup \{h_{T1}, ..., h_{Tm}\}\]
SELECT *
FROM S s, T t -- or CROSS JOIN

(Natural) Join

\(h_S\) and \(h_T\) may have an overlap:

\[O = \{h_{S1}, ..., h_{Sn}\} \cap \{h_{T1}, ..., h_{Tm}\}\]

Then

\[S \bowtie T = \{(s, t) | s \in S, t \in T, \forall_{o \in O}\ o(s) = o(t)\}\]
\[S \bowtie T = \{(s, t) | s \in S, t \in T\}\]
\[\rho_{a/b}(S) = (\{ h_1, ..., h_n\} \setminus \{a\}) \cup \{b\}\]
SELECT *
FROM S s
NATURAL JOIN T t

There are also:

  • \(\theta\)-join

  • inner join

  • left outer join

  • right outer join

  • full outer join

  • etc

Union

\[R \cup S = \{x|x \in R \vee x \in S\}\]
(SELECT * FROM S s)
UNION
(SELECT * FROM T t)

Weren’t we supposed to talk about transformations?

Transformations and relational operations

Projection

SELECT src.e, src.d, src.c, src.b, src.a -- order we want in the result
FROM Source src -- order in Source table is irrelevant
case class Source(a: Int, b: Int, c: Int, d: Int, e: Int)

case class Target(e: Int, d: Int, c: Int, b: Int, a: Int)
val src: Source = ...
val tgt = src.into[Target].transform
       // Target(e = src.e, d = src.d, c = src.c, b = src.b, a = src.a)
SELECT src.e, src.d, src.c, src.b, src.a -- extra fields are ignored
FROM Source src -- have an additional columns f and g
case class Source(a: Int, b: Int, c: Int, d: Int, e: Int, f: Int, g: Int)
                                                       // ↑ extra fields
case class Target(e: Int, d: Int, c: Int, b: Int, a: Int)
val src: Source = ...
val tgt = src.into[Target].transform
       // Target(e = src.e, d = src.d, c = src.c, b = src.b, a = src.a)

Renames

SELECT src.e, src.d, src.c AS z -- column c is renamed to z
FROM Source src -- has attribute c
case class Source(a: Int, b: Int, c: Int)

case class Target(a: Int, b: Int, z: Int)
                               // ↑ renamed field
val src: Source = ...
val tgt = src.into[Target]
            .withFieldRenamed(_.c, _.z) // <-- c AS z
            .transform
       // Target(a = src.a, b = src.b, z = src.c)

Cartesian product

SELECT foo.a, foo.b, foo.c, bar.d, bar.e
FROM Foo foo
CROSS JOIN Bar bar -- Cartesian product
case class Foo(a: Int, b: Int, c: Int)

case class Bar(d: Int, e: Int)

case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)
val foo: Foo = ...
val bar: Bar = ...
val baz = foo.into[Baz]
            .withFallback(bar) // <-- CROSS JOIN
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = bar.d, e = bar.e)
SELECT foo.a, foo.b, foo.c, provided.d, provided.e
FROM Foo foo
CROSS JOIN (SELECT 1 as d, 2 as e) provided
case class Foo(a: Int, b: Int, c: Int)

case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)
val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldConst(_.d, 1)
            .withFieldConst(_.e, 2)
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = 1, e = 2)
SELECT foo.a, foo.b, foo.c, 1 AS d, 2 AS e -- use constants without a table
FROM Foo foo
case class Foo(a: Int, b: Int, c: Int)
                                    // ↓ extra fields
case class Baz(a: Int, b: Int, c: Int, d: Int, e: Int)
val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldConst(_.d, 1)
            .withFieldConst(_.e, 2)
            .transform
       // Baz(a = foo.a, b = foo.b, c = foo.c, d = 1, e = 2)
SELECT foo.a, foo.b, foo.c, custom_function(foo.a, foo.b) AS d
FROM Foo foo
case class Foo(a: Int, b: Int, c: Int)
                                    // ↓ extra field
case class Baz(a: Int, b: Int, c: Int, d: Int)
val foo: Foo = ...
val baz = foo.into[Baz]
            .withFieldComputed(_.d, foo => custom_function(foo.a, foo.b))
            .transform
       // Baz(a = foo.a, b = foo.b, c = custom_function(foo.a, foo.b))

These were quite easy.

Let’s try something harder.

Enums

enum UserType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
}
val userType: UserType = ...
userType match {
  case UserType.Admin => ...
  case UserType.Moderator(roles) => ...
  case UserType.User(verified) => ...
}
enum UserType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
}
enum UserDBType {
  case Admin
  case Moderator(roles: List[String])
  case User(verified: Boolean)
  case Suspended // <-- extra case
}
val userType: UserType = ...
userType match {
  case UserType.Admin => UserDBType.Admin
  case UserType.Moderator(roles) => UserDBType.Moderator(roles)
  case UserType.User(verified) => UserDBType.User(verified)
  // nothing is matched into Suspended and that's ok
}
-- we can use unique attribute names to disntinct from which case they came
SELECT ut.type, null as roles, null as verified FROM UserType ut WHERE ut.type = 'Admin'
UNION
SELECT ut.type, ut.roles, null as verified FROM UserType ut WHERE ut.type = 'Moderator'
UNION
SELECT ut.type, null as roles, ut.verified FROM UserType ut WHERE ut.type = 'User'

Nested data

case class OuterFoo(inner: InnerFoo)
case class InnerFoo(a: Int, b: Int, c: Int)

case class OuterBar(inner: InnerBar)
case class InnerBar(a: Int, b: Int, c: Int)
val foo: OuterFoo = ...

val baz = foo.into[OuterBar].transform // <--is this still principled?
OuterBar(
  inner = foo.inner.into[InnerBar].transform
)
// or
OuterBar(
  inner = {
    val inner = foo.inner
    InnerBar(a = inner.a, b = inner.b, c = inner.c)
  }
)
// or
OuterBar(
  inner = InnerBar(a = foo.inner.a, b = foo.inner.b, c = foo.inner.c)
)
SELECT foo.`inner.a` AS `inner.a`,
       foo.`inner.b` AS `inner.b`,
       foo.`inner.c` AS `inner.c`
FROM OuterFoo foo

It looks too easy, so let’s complicate it a bit.

case class OuterFoo(inner: InnerFoo)
case class InnerFoo(a: Int, b: Int)

case class OuterBar(inner: InnerBar)
case class InnerBar(c: Int, d: Int)

case class OuterBaz(inner: InnerBaz)             // ↓ extra field
case class InnerBaz(a: Int, b: Int, c: Int, d: Int, e: Int)
val foo: OuterFoo = ...
val bar: OuterBar = ...

val baz = foo.into[OuterBaz]
   .withFallback(bar) // <-- CROSS JOIN / Cartesian product
   .withFieldConst(_.inner.e, 1) // <-- extra field
   .transform
SELECT foo.`inner.a` AS `inner.a`,
       foo.`inner.b` AS `inner.b`,
       bar.`inner.c` AS `inner.c`,
       bar.`inner.d` AS `inner.d`
       1             AS `inner.e`
FROM OuterFoo foo
CROSS JOIN OuterBar bar

Optional values

case class Source(value: Int)

case class Target(value: Option[Int])
val source: Source = ...
val target = source.into[Target].transform
          // Target(value = Some(source.value))
case class Source2(value: Option[Int])

case class Target2(value: Int)
val source2: Source2 = Source(None) // <-- cannot be unwrapped!
val target2 = ???

Embeddings

\[E_A = A \cup \{Empty,\ ErrorMsg,\ Exception \}\]
// pseudo-code
class Empty
class ErrorMsg
class Exception

type E[A] = A | Empty | ErrorMsg | Exception

val result: E[Target2] = ...

result match {
  case e: Empty => ...
  case e: ErrorMsg => ...
  case e: Exception => ...
  case success: Target2 => ...
}
val source2: Source2 = Source(None) // <-- cannot be unwrapped!
val result: partial.Result[Target2] = source2.intoPartial[Target2].transform

result match { // <-- actually, implemented as enum (sealed trait)
  case partial.Result.Value(target2) => ... // <-- no runtime failures
  case partial.Result.Errors(errors) => ... // <-- 1 or more errors
}

Different naming conventions

case class Source(someName: String, anotherName: String) // camel case

case class Target(some_name: String, another_name: String) // snake case

Equivalence relation

  1. symmetric - \(a \equiv b \iff b \equiv a\)

  2. reflexive - \(a \equiv a\)

  3. transitive - \(a \equiv b \land b \equiv c \implies a \equiv c\)

If we make values of difference cases equivalent, we can merge them into a single equivalence class.

\[\{someName, some\_name, SomeName, ...\}\]
\[\{anotherName, another\_name, AnotherName, ...\}\]
// Define equivalence relation via a predicate and use it for mapping
case object DifferenceCasesAreEqual extends TransformedNamesComparison {

  def namesMatch(fromName: String, toName: String): Boolean = ...
}
case class Source(someName: String, anotherName: String) // camel case

case class Target(some_name: String, another_name: String) // snake case

val source: Source = ...
val target = source.into[Target]
  .enableCustomFieldNameComparison(DifferenceCasesAreEqual)
  .transform
// Target(some_name = source.someName, another_name = source.anotherName)

So far:

  • resolving values for fields can follow some reasonable rules expressed declaratively

  • similarly enums/coproducts

  • and nested data

  • unavoidable runtime errors can be handled in a type-safe way

  • we can provide missing values in a principled way as well

Where is the catch?

Real World

Fields vs normal vals

case class Source(a: Int, b: String, c: Double)

case class Target(a: Int, b: String, c: Double)

val source: Source = ...
val target = source.into[Target].transform
          // Target(a = source.a, b = source.b, c = source.c)
abstract class Parent {

  val a: Int = 1 // a is declared as val, but in parent!
}
case class Source(d: Int) extends Parent {

  def b: String = "value" // b is declared as method, not field!

  val c: Double = 1.0 // c is declared as val, but not in "fields" list!
}

case class Target(a: Int, b: String)

val source: Source = ...
val target = source.into[Target].transform
          // Target(a = source.a, b = source.b, c = source.c)

flattening

case class User(meta: UserMetadata, data: UserData)
case class UserMetadata(id: Long, createdAt: Instant)
case class UserData(name: String, age: Int)

case class UserDto(id: Long, name: String, age: Int, createdAt: Instant)

val user: User = ...

val userDto = ??? // <-- should we automatically flatten user?

Java Beans

case class User(name: String, age: Int)

class UserBean {
  private var name: String = uninitialized
  def getName(): String = name
  def setName(name: String): Unit = this.name = name

  private var age: Int = uninitialized
  def getAge(): Int = age
  def setAge(age: Int): Unit = this.age = age
}
val user: User = ...
// Requires mapping:
//  - setName(user.name)
//  - setAge(user.age)
val userBean: UserBean = user.into[UserBean].transform
val userBean: UserBean = ...
// Requires mapping:
//  - name = userBean.getName()
//  - age = userBean.getAge()
val user: User = userBean.into[User].transform
trait NamedSet {

  // name of the set value belongs to
  def setName: String = "User"
}
case class User(name: String, age: Int) extends NamedSet {

  // value belongs to User set
  override def setName: String = "User"
}

class UserBean {
  private var name: String = uninitialized
  def getName(): String = name
  def setName(name: String): Unit = this.name = name

  private var age: Int = uninitialized
  def getAge(): Int = age
  def setAge(age: Int): Unit = this.age = age
}

AnyVal and Option

case class Source(id: String, data: String) // Newtype-like ↓
case class SourceOptional(source: Option[Source]) extends AnyVal

case class Target(id: String, data: String)

val source: Source  = ...
source.into[Target].transform
// Target(id = source.id, data = source.data)
source.into[Option[Target]].transform
// Some(Target(id = source.id, data = source.data))
Option(source).into[Option[Target]].transform
// Option(source).map(src => Target(id = src.id, data = src.data))
Option(source).intoPartial[Target].transform.asOption
// Some(Target(id = source.id, data = source.data))

Option.empty[Source].intoPartial[Target].transform.asOption
// None
case class Source(id: String, data: String) // Newtype-like ↓
case class SourceOptional(source: Option[Source]) extends AnyVal

case class Target(id: String, data: String)

val source: Source  = ...
// What does it do?
SourceOptional(Some(source)).intoPartial[Option[Target]].transform.asOption
SourceOptional(None).intoPartial[Option[Target]].transform.asOption
// hint: type is Option[Option[Target]]
// Expected:
Some(Some(Target(id = source.id, data = source.data)))
Some(None)
// Actual:
Some(Some(Target(id = source.id, data = source.data)))
None

Summary

  1. when we have schemas and source and target are similar, we can define some principled way to map between them

  2. but users might have expectations about the types, which might require creativity in defining the rules

  3. in RL code there will be cases with no single obvious behavior

  4. in RL code there will be feature interaction bugs

Thank you!