Share on FacebookShare on TwitterShare on LinkedinShare via emailShare via Facebook Messenger

Building a Versatile Analytics Pipeline on Top of Apache Spark

Updated on November 2, 2020DataGeneral Engineering

Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:

  • Outputting data to several storages in a single Spark job
  • Dealing with the Spark memory model, building a custom spillable data structure for data traversal
  • Implementing a custom query language with parser combinators on top of the Spark SQL parser
  • Custom query optimizer and analyzer
  • Flexible-schema storage and query against multi-schema data with schema conflicts
  • Custom aggregation functions in Spark SQL

Shape the way millions of people communicate!

Here is the video of the talk:

Check out the slides as well:

Your writing, at its best.
Works on all your favorite websites
iPhone and iPad KeyboardAndroid KeyboardChrome BrowserSafari BrowserFirefox BrowserEdge BrowserWindows OSMicrosoft Office

Improve your writing with a Grammarly Free account

  • Write without mistakes
  • Generate text with AI prompts
  • Learn your writing tone
Related Articles
Shape the way millions of people communicate every day!