Spark

3 posts

Thrill – Big Data Processing with C++

Yesterday, I discovered an experimental Big Data processing framework written in C++ called Thrill. As most of you surely know, the well-known frameworks of this kind are mostly based on JVM, like Apache Spark or Apache Flink. This, of course, has many advantages, like easily accessible interfaces and a more domain-oriented approach, as we don’t have to deal with “Ceremony Code” or any internals that don’t touch our domain logic. However, everything comes at a cost and utilizing a VM is a price to be paid no matter how optimized your code is. It’s no wonder these projects often resort to […]

Apache Spark

Data Science for Losers, Part 5 – Spark DataFrames

Sometimes, the hardest part in writing is completing the very first sentence. I began to write the “Loser’s articles” because I wanted to learn a few bits on Data Science, Machine Learning, Spark, Flink etc., but as the time passed by the whole degenerated into a really chaotic mess. This may be a “creative” chaos but still it’s a way too messy to make any sense to me. I’ve got a few positive comments and also a lot of nice tweets, but quality is not a question of comments or individual twitter-frequency. Do these texts properly describe “Data Science”, or at […]

Writing Monads in Scala with Spark-Notebook

Douglas Crockford once said that people who finally understand Monads immediately lose the capability to explain them to others. Well, the few readers of this chaotic blog are lucky: neither I understand them nor am able to explain them anyway. However, I can say in advance that a Monad in Scala is something that implements two methods: map and flatMap. Haskell coders (luckily, they’re certainly not reading this blog) now would say: No, there’s no flatMap but only bind written as >>=. Yes, I know but anyway, we’ll stick with flatMap. And to make this article somewhat cooler we’ll use a […]