Loading…
Wednesday, June 7 • 6:00pm - 8:00pm
Integrated Dataflow Processing with Spark and StreamSets

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
This event is run as the SF Hadoop meetup:

https://www.meetup.com/hadoopsf/events/239675820/

Join us for a meetup with StreamSets to discuss the latest Spark elements

Agenda: 

6:00-6:30pm Food, drinks and networking

6:30-7:30pm Tech talk

7:30-8:00pm Networking 

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources such as relational databases and log files can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC), is Apache 2.0 licensed open source software that allows data scientists and data engineers to build robust big data ingest pipelines using pre-built and custom processing stages via a browser-based UI.

In this session, Hari will explain how SDC integrates with Apache Spark, and how developers can create their own custom reusable processing elements using Spark’s programming model and existing libraries such as GraphX or MLLib. You'll learn how Spark can run SDC pipelines in a wide variety of environments, from standalone systems such as a developer's laptop, to on-premises and in-cloud clusters, allowing developers, data scientists and data engineers to process data at unprecedented scale.


Wednesday June 7, 2017 6:00pm - 8:00pm PDT
General Assembly 225 Bush St. (5th floor), San Francisco