Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

support

claim!

report

Build A Data Lake For Your Security Logs With Scanner

Summary

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Scanner is and the story behind it?
- What were the shortcomings of other tools that are available in the ecosystem?
What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)
A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.
- What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?
Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?
Can you describe the architecture of the Scanner platform?
- What were the motivating constraints that led you to your current implementation?
- How have the design and goals of the product changed since you first started working on it?
Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?
What are the personas of the end-users for Scanner?
- How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?
For teams who are working with Scanner can you describe how it fits into their workflow?
What are the most interesting, innovative, or unexpected ways that you have seen Scanner used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?
When is Scanner the wrong choice?
What do you have planned for the future of Scanner?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

Scanner
cURL
Rust
Splunk
S3
AWS Athena
Loki
Snowflake
- Podcast Episode
Presto
[Trino](thttps://trino.io/)
AWS CloudTrail
GitHub Audit Logs
Okta
Cribl
Vector.dev
Tines
Torq
Jira
Linear
ECS Fargate
SQS
Monoid
Group Theory
Avro
Parquet
OCSF
VPC Flow Logs

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA