If you want to transform a Spark’s dataframe schema into a String, you have two schema string representation available: JSON and DDL. DDL stands for Data Definition Language and provides a very concise way to represent a Spark Schema. But how do we represent a Spark’s schema in DDL ?
You have to create a docker image containing an artifact. However to be built, this artifact requires tools that you don’t need to put in your docker image. How to ensure to have the smallest docker image without loading useless tools only used for building artifact ? The solution is docker multi-stage builds
A small code snippet to recursively list all
csv files in a directory on a databricks notebook in Python.
With Testcontainers library, you can use a docker container providing services such as a database for your test. With Flyway library, you can track the schema changes of your database and ensure that those changes are applied on all its instances. How can you initialize your test database provided by Testcontainers with the schema described in Flyway ? In this post, we will see how to initialize a postgresql database in a docker container with Flyway scripts.
Simple configuration of a new Python IntelliJ IDEA project with working pyspark. I was inspired by "Pyspark on IntelliJ" blog post by Gaurav M Shah, I just removed all the parts about deep learning libraries. I assume that you have a working IntelliJ IDEA IDE with Python plugin installed, and Python 3 installed on your machine. We will create a Python project in IntelliJ IDEA, change its Python SDK to a virtualenv based Python SDK, add Pyspark dependency to this VirtualEnv, install Pyspark in this VirtualEnv and finally test it using a small Pyspark hello world.