[HN Gopher] Launch HN: Matano (YC W23) - Open-Source Security La...
       Launch HN: Matano (YC W23) - Open-Source Security Lake Platform
       (SIEM) for AWS
       Hi HN! We're Shaeq and Samrose, co-founders of Matano
       (https://matano.dev). Matano is a high-scale, low-cost alternative
       to traditional SIEM (e.g. Splunk, Elastic) built around a vendor-
       agnostic security data lake that deploys to your AWS account.
       Don't worry -- we'll explain all this jargon in a second.  SIEM
       stands for "Security Information and Event Management" and refers
       to log management tools used by security teams to detect threats
       from an organization's security logs (network, host, cloud, SaaS
       audit logs, etc.) and send alerts about them. Security engineers
       write detection rules inside the SIEM as queries to detect
       suspicious activity and create alerts. For example, a security
       engineer could write a detection rule that checks the fields in
       each CloudTrail log and creates an alert whenever an S3 bucket is
       modified with public access, to prevent data exfiltration.
       Traditional SIEM tools (e.g. Splunk, Elastic) used to analyze
       security data are difficult to manage for security teams on the
       cloud. Most don't scale because they are built on top of a NoSQL
       database or search engine like Elasticsearch. And they are
       expensive -- the enterprise SIEM vendors have costly ingest-based
       licenses. Since security data from SaaS and cloud environments can
       exceed hundreds of terabytes, teams are left with unsatisfactory
       options: either not collect some data, leave some data unprocessed,
       pay exorbitant fees to an enterprise vendor, or build their own
       large-scale solution for data storage (aka "data lake").  Companies
       like Apple, HSBC, and Brex do the latter: they build their own
       security data lakes to analyze their security data without breaking
       the bank. "Data lake" is jargon for heterogeneous data that is too
       large to be kept in a standard database and is analyzed directly
       from object storage like S3. A "security data lake" is a repository
       of security logs parsed and normalized into a common structure and
       stored in object storage for cost-effective analysis. Building your
       own data lake is a fine option if you're big enough to justify the
       cost -- but most companies can't afford it.  Then there's the
       vendor lock-in issue. SIEM vendors store data in proprietary
       formats that make it difficult to use outside of their ecosystem.
       Even with "next-gen" products that leverage data lake technology,
       it's nearly impossible to swap out your data analytics stack or
       migrate your security data to another tool because of a tight
       coupling of systems designed to keep you locked in.  Security
       programs also suffer because of poor data quality. Most SIEMs today
       are built as search engines or databases that query
       unstructured/semi-structured logs. This requires you to heavily
       index data upfront which is inefficient, expensive and makes it
       hard to analyze months of data. Writing detection rules requires
       analysts to use vendor-specific DSLs that lack the flexibility to
       model complex attacker behaviors. Without structured and normalized
       data, it is difficult to correlate across data sources and build
       effective rules that don't create many false positive alerts.
       While the cybersecurity industry has been stuck dealing with these
       legacy architectures, the data analytics industry has seen a ton of
       innovation through open-source initiatives such as Apache Iceberg,
       Parquet, and Arrow, delivering massive cost savings and performance
       breakthroughs.  We encountered this problem when building out
       petabyte-scale data platforms at Amazon and Duo Security. We
       realized that most security teams don't have the resources to build
       a security data lake in-house or take advantage of modern analytics
       tools, so they're stuck with legacy SIEM tools that predate the
       cloud.  We quit our jobs at AWS and started Matano to close the gap
       between these two worlds by building an OSS platform that helps
       security teams leverage the modern data stack (e.g. Spark, Athena,
       Snowflake) and efficiently analyze security data from all the
       disparate sources across an organization.  Matano lets you ingest
       petabytes of security and log data from various sources, store and
       query them in an open data lake, and create Python detections as
       code for realtime alerting.  Matano works by normalizing
       unstructured security logs into a structured realtime data lake in
       your AWS account. All data is stored in optimized Parquet files in
       S3 object storage for cost-effective retention and analysis at
       petabyte scale. To prevent vendor lock-in, Matano uses Apache
       Iceberg, a new open table format that lets you bring your own
       analytics stack (Athena, Snowflake, Spark, etc.) and query your
       data from different tools without having to copy any data. By
       normalizing fields according to the Elastic Common Schema (ECS), we
       help you easily search for indicators across your data lake, pivot
       on common fields, and write detection rules that are agnostic to
       vendor formats.  We support native integrations to pull security
       logs from popular SaaS, Cloud, Host, and Network sources and custom
       JSON/CSV/Text log sources. Matano includes a built-in log
       transformation pipeline that lets you easily parse and transform
       logs at ingest time using Vector Remap Language (VRL) without
       needing additional tools (e.g. Logstash, Cribl).  Matano uses a
       detection-as-code approach which lets you use Python to implement
       realtime alerting on your log data, and lets you use standard dev
       practices by managing rules in Git (test, code review, audit).
       Advanced detections that correlate across events and alerts can be
       written using SQL and executed on a scheduled basis.  We built
       Matano to be fully serverless using technologies like Lambda, S3,
       and SQS for elastic horizontal scaling. We use Rust and Apache
       Arrow for high performance. Matano works well with your existing
       data stack, allowing you to plug in tools like Tableau, Grafana,
       Metabase, or Quicksight for visualization and use query engines
       like Snowflake, Athena, or Trino for analysis.  Matano is free and
       open source software licensed under the Apache-2.0 license. Our use
       of open table and common schema standards gives you full ownership
       of your security data in a vendor neutral format. We plan on
       monetizing by offering a cloud product that includes enterprise and
       collaborative features to be able to use Matano as a complete
       replacement to SIEM.  If you're interested to learn more, check out
       our docs (https://matano.dev/docs), GitHub repository
       (https://github.com/matanolabs/matano), or visit our website
       (https://matano.dev).  We'd love to hear about your experiences
       with SIEM, security data tooling, and anything you'd like to share!
       Author : wizwit999
       Score  : 101 points
       Date   : 2023-01-24 16:17 UTC (6 hours ago)
       | mdaniel wrote:
       | I loaded the GitHub link, bracing myself for yet another AGPL
       | license, but no, it's Apache 2! So I wanted to say thank you for
       | that and I hope to take a deeper look when I'm back at my desk
       | because trying to keep Splunk alive and happy is a monster pain
       | point. There are so many data sources we'd love to throw at it
       | but we don't have the emotional energy to put up with Splunk
       | crying about it
         | shaeqahmed wrote:
         | Thank you! We definitely believe in open source and don't need
         | AGPL. Sending you love as you deal with that Splunk instance.
         | P.S. feel free to open some issues for any log sources you'd
         | like to see supported in Matano
       | slt2021 wrote:
       | Question to Matano authors - won't your solution simply enrich
       | AWS by blowing up my cloud bill ?
       | Did you estimate how many times lambda will get invoked and what
       | will be AWS bill for 1 million events ingested? I am curious to
       | learn the price to pay for serverless SIEM
         | shaeqahmed wrote:
         | The code is written in high performance multi-threaded Rust and
         | uses the [1] Arrow compute framework. We also batch events and
         | target about 32MB of event data per lambda invocations. As a
         | result it can process tens of thousands of events per second
         | per thread, depending on the number of transformations.
         | That said, we are working on performance estimates and a
         | benchmark on some real world data for Matano to help users like
         | you better understand the cost factors. Stay tuned.
         | [1] https://github.com/jorgecarleitao/arrow2
       | bovermyer wrote:
       | Oh, this very much has my attention. I'll be checking this out in
       | depth.
       | wdb wrote:
       | Anyone aware of a similar solution for Google Cloud / GCP?
         | shaeqahmed wrote:
         | We are working on a solution for GCP and Azure :) GCP recently
         | announced Iceberg support with BigLake and support for
         | federation across multi-cloud lakes so it would be perfect use
         | cases.
         | If you are interested in using Matano for GCP, feel free to
         | reach out and join our Discord community! We are FOSS so would
         | love to collaborate on a solution.
           | parkerhiggins wrote:
           | Definitely looking forward to GCP.
       | protoduction wrote:
       | Hi Shaeq and Samrose - congrats on the launch! Matano looks
       | great.
       | Out of curiosity, at some point I believe you were working on a
       | predecessor called AppTrail whic tackled (customer-facing) audit
       | logs, it was something I was interested in at the time (and still
       | am! I would've loved to use that).
       | Would you perhaps be willing to share your learnings from that
       | product, and (I assume) why it evolved into Matano?
         | shaeqahmed wrote:
         | Thank you! Yes with AppTrail we wanted to solve the pain points
         | around SaaS audit logs but since it was a product that needed
         | to be sold and integrated into B2B startups rather than the
         | enterprises that felt the pain points and needed audit logs in
         | their SIEM, we couldn't find a big enough market to sell it.
         | We realized that the big problem was that most SIEM out there
         | today did a poor job with pulling and handling the data from
         | the multitude of SaaS and Cloud log sources that orgs have
         | today, and decided to build Matano as a cloud-native SIEM
         | alternative :)
       | waihtis wrote:
       | I'm a vendor in the cyberspace so not a potential customer (feel
       | free not to waste time answering) but am just intellectually
       | curious who you're targeting this at. High-skill tech companies
       | who are just building up a security program? I don't see most
       | security teams building their own SIEM'ish solution just because
       | they really don't have the chops or resource to do it. OTOH, it
       | would be a big rip-out operation for F100 companies to change to
       | this from Splunk et al.
         | shaeqahmed wrote:
         | Many enterprises using Splunk are already being forced to
         | purchase products like Cribl to route some of their data to a
         | data lake because writing it all to Splunk is just way too
         | expensive at that scale 1-100TB+/day (7 figures $).
         | But a data lake shouldn't just be a dump of data right? Matano
         | OSS helps organizations build high value data lakes in S3 and
         | reduce their dependency on SIEM by centralizing high throughput
         | data in object storage using Matano to power investigations. To
         | give you an example, one company is using Matano to collect,
         | normalize, and store VPC Flow logs from hundreds of AWS
         | accounts which was too expensive with traditional SIEM.
         | Matano is also completely serverless and automates the
         | maintenance of all resources/tables using IaC so it's perfect
         | for smaller security teams on the cloud dealing with a large
         | amount of data and wanting to use a modern data stack to
         | analyze it.
           | waihtis wrote:
           | nice thanks, makes a lot of sense
         | lmeyerov wrote:
         | (not them, but in this space with major enterprises and gov
         | agencies deploying Graphistry)
         | We are pretty active here with security cloud/on-prem data
         | lakes teams as a way to augment their Splunk with something
         | more affordable & responsive for bigger datasets. Imagine
         | stuffing netflow or winlogs somewhere at TB/PB scale and not
         | selling your first born child. A replacement/fresh story may
         | happen at a young/midage tech co, and a bunch of startups
         | pitching that. But for most co's, we see augmentation and still
         | needing to support on-prem detection & response flows.
         | It's pretty commodity now to dump into say databricks, and we
         | work with teams on our intelligence tier with GPU visual
         | analytics, GPU AI, GPU graph correlation, etc. to make that
         | usable. Most use us to make sense of regular alert data in
         | Splunk/neo4j/etc. However, it's pretty exciting when we do
         | something like looking at vpc flow logs from a cloud-native
         | system like databricks and can thus look at more session
         | context and run fusion AI jobs like generating correlation IDs
         | for pivoting + visualizing.
         | Serverless is def interesting but I've only seen for light
         | orchestration. Everyone big has on-prem footprint which is an
         | extra bit of fun for the orchestration vs investigation side.
           | waihtis wrote:
           | Thanks, this is interesting. I work a bit closer to the
           | "source" as in doing detection things on on-prem & cloud
           | side, so not too well-versed on the data processing and
           | management side.
       | molsongolden wrote:
       | Excited to give this a try and follow your progress!
       | In case anybody else is wondering how Matano compares to Panther
       | (my first thought reading this launch post) there's a comparison
       | on the Matano website[0].
       | Quick note to the Matano team, the "Elastic Common Schema (ECS)"
       | link in the readme[1] seems to be broken.
       | [0] https://www.matano.dev/alternative-to/panther
       | [1] https://github.com/matanolabs/matano#-log-transformation--
       | da...
         | wizwit999 wrote:
         | Thank you, fixed the link!
       | simonebrunozzi wrote:
       | Shaeq and Samrose: for us investors here, where are you in terms
       | of fundraising? I'm an ex AWS (google me, you'll have a few
       | laughs!), turned VC in the past few years. $HN_username at gmail
       | if you want to reach out and chat!
       | Edit: here's me with Andy, from a millenium ago [0].
       | [0]: https://www.youtube.com/watch?v=bWL0_Xdntzo&t=2907s
       | debarshri wrote:
       | I have been exploring this realm of SIEM, XDR, NDR etc. Sure, all
       | proprietary SIEMs are expensive. But what is not clear is how you
       | are going to price it. Security teams have dedicated budget. If
       | you are coming cheaper than them, they you are destroying your
       | TAM because I know customer would not mind paying those license
       | fees. OSS GTM might work but might against your TAM.
         | shaeqahmed wrote:
         | We think building a more efficient solution using data lakes is
         | a win-win because it unlocks additional use cases for customers
         | and allows them to analyze larger datasets within the same
         | budget.
         | Solutions that offer a magnitude of order better performance
         | than what is available today are critical for the industry
         | because the amount of data teams are dealing with is growing
         | much faster than their budgets!
         | 0x4e53 wrote:
         | At least from my time at SpaceX - this is untrue.
         | SIEM costs were rapidly ballooning, and we were being charged
         | by RAM. RAM?? Of all things!!
         | After our SIEM costs for ELK ramped up to where Splunk was - we
         | just bought Splunk instead. I imagine there are many security
         | teams out there that would entertain a cheaper alternative that
         | isn't priced by RAM.
           | slt2021 wrote:
           | the reason for that is near real-time detection of threats
           | requires aggregation of terabytes of data according to rules
           | (continuous GROUP BY on thousands columns on a sliding
           | window) - and these aggregates by design have to be stored in
           | RAM.
           | Otherwise these detections stop being near-realtime and
           | become offline detection instead, just like any other sql
           | server.
             | 0x4e53 wrote:
             | To be clear - we were hosting on-premise, and being charged
             | for our own RAM. Servers we had to buy, and then pay for
             | the privilege of using with ELK.
       | alecbell wrote:
       | This is awesome. Nice work open-sourcing it! I used Splunk at
       | Expedia and it was super expensive and slow. While I wasn't using
       | it for security purposes, it could take 15-30 min for us to
       | detect error logs, and I can imagine that's not okay for security
       | purposes. Good luck guys!
         | sullivanmatt wrote:
         | Wowsa. Somebody didn't do their job right if it took anywhere
         | near that amount of time to get logs back. Sorry it was so
         | painful.
       | napolux wrote:
       | Super random question... I wonder if the name is related to Frank
       | Matano, the italian youtuber/comedian.
         | simonebrunozzi wrote:
         | Funny, I had the same in mind! (he's really funny!)
         | shaeqahmed wrote:
         | Nope, Matano is named after one of the deepest lakes in the
         | world - Lake Matano of Indonesia ;)
           | napolux wrote:
           | TIL
           | Thanks!
       | brunes wrote:
       | How do you position this against AWS's own Security Lake
       | announced at re:Invent in November
       | (https://aws.amazon.com/security-lake/) ?
       | Your architecture diagram looks like a carbon copy of theirs.
         | shaeqahmed wrote:
         | We launched before Amazon Security Lake :)
         | Amazon Security Lake's main value prop is that it is a single
         | place where AWS / partner security logs can be stored and sent
         | to downstream vendors. As such, Amazon only writes OCSF
         | normalized logs to the parquet-based data lake for it's own
         | data in a fully managed way (VPC flow logs, Cloudtrail, etc.)
         | and leaves it to the customers to handle the rest.
         | For partner sources, the integration approach has been to tell
         | customers to set up infrastructure themselves to accomplish
         | OCSF normalization, parquet conversion, etc. For example, here
         | is okta's guide using Firehose and Lambda,
         | https://www.okta.com/blog/2022/11/an-automated-approach-to-c...
         | The Amazon Security Lake offering is built on top of Lake
         | Formation, which itself is an abstraction around services such
         | as Glue, Athena, and S3. Security Lake is built using the
         | legacy Hive style approach and does not use Athena Iceberg.
         | There is a per-data cost associated with the service, in
         | addition to the costs incurred by other services for your data
         | lake. Looks like the primary use case of the service is being
         | able to store first-party AWS logs across all your accounts in
         | a data lake and being able to route them to analytical partners
         | (SIEM) without much effort. It does not seem very useful for an
         | organization that is looking to build its own security data
         | lake with more advanced features, as you will still have to do
         | all the work yourself.
         | Matano, has a broader goal to help orgs in every step of
         | transforming, normalizing, enriching and storing _all_ of their
         | security logs into a structured data lake, as well as giving
         | users a platform to build detection-as-code using Python  & SQL
         | for correlation on top of it (SIEM augmentation/alternative).
         | All processing and data lake management (conversion to parquet,
         | data compaction, table management) is fully automated by
         | Matano, and users do not need to write any custom code to
         | onboard data sources.
         | Matano can ingest data from Cloud, Endpoint, SaaS, and
         | practically any custom source using the in-built Log
         | transformation pipeline (think serverless Logstash). We are
         | built around the Elastic Common Schema, and use Apache Iceberg
         | (ACID support, recommended for Athena V2+). Matano's data lake
         | is also vendor neutral and can be queried by any Iceberg-
         | compatible engine without having to copy any data around
         | (Snowflake, Spark, etc.).
       | sullivanmatt wrote:
       | This issue exists to the right of your solution and is (for now)
       | out of scope, but the biggest issue I have with security data
       | lakes is the need to (easily) get both row-based data and
       | visualizations. Back when I had access to a well-built and cared
       | for Splunk environment, I would constantly run queries, build
       | visualizations, go back to the results index, tweak the query, go
       | back to viz, etc. This feedback loop is important and allows for
       | fast iteration, especially if you are conducting a high-stakes
       | investigation and need answers rapidly. I should be able to look
       | at my available fields and tweak the viz accordingly in under a
       | few seconds; preferably in one mouse click.
       | Now I live on an ELK stack and I experience nothing but full-time
       | agony as I switch between Kibana and Kibana Lens constantly. It's
       | clear they are two completely separate "products" built for
       | different use-cases. The experience reminds you constantly that
       | they were not purpose-built for how I use them, unlike Splunk.
       | Increasingly we are moving towards the reality of a security data
       | lake, and all I can think is that I'm about to lose what little
       | power I had left as I have to move to something like Mode,
       | Sisense, or Tableau which again, were not purpose-built for these
       | use-cases and even further separate the query/data discovery and
       | visualization layers.
       | I hate how crufty and slow Splunk has gotten as an organization,
       | and they use their accomplishments from 15 years ago to justify
       | the exorbitant price they charge. I really hope the OSS/next-gen
       | SaaS options can fill this need and security data lake becomes a
       | reality. But for that to happen, more focus is needed on the user
       | experience as well.
       | Regardless, very cool stuff and could definitely fill a need for
       | organizations that are just starting to dip toes into security
       | data lakes. I wish you success!
         | shaeqahmed wrote:
         | I completely agree with you and the need for a fully integrated
         | solution with great visualizations without hosting additional
         | tools that aren't purpose built! Unfortunately there are very
         | few SIEMs that get this right today..
         | Here's how we are thinking of it. We think it's important for a
         | successful security program to first have high quality data and
         | this is why we want help every organization build structured
         | security data lakes to power their analysis using our open
         | source project. The Matano security lake can sit alongside
         | their SIEM and be incrementally adopted for a data sources that
         | wouldn't be feasible to analyze otherwise.
         | Our larger goal as a company though is to build a complete
         | platform that allows a security data lake to fully replace
         | traditional SIEM -- including a UI and collaborative features
         | that give you that great feedback loop for fast iteration in
         | detection engineering and threat hunting as you mentioned. Stay
         | tuned I think you will be excited by what we are building!
           | sullivanmatt wrote:
           | For sure. Pull a dbt and get everybody hooked on your tool,
           | then slap a SaaS platform ecosystem to the farthest right and
           | watch the revenue flow.
             | mox1 wrote:
             | Splunk is HEAVILY pushing their SaaS offering at the
             | moment. They are the most obnoxious vendor we currently
             | deal with.
             | We are fine on prem, pay big $$ license fees, but not
             | enough. They want that sweet SaaS revenue.
             | I would be wary of pushing this, being a non-SaaS platform
             | could be an advantage here.
               | feanaro wrote:
               | Indeed, no more SaaS. I've had enough of this cloud
               | nonsense already.
         | wrldos wrote:
         | The biggest issue I have with data lakes is they _always
         | without fail_ turn into a data cesspool. The more you add the
         | less ROI you get out. And yes using Splunk as an example it
         | becomes an organisational cost problem. I have spent way too
         | many hours arguing with them over billing.
         | The only viable solution is design metrics into your platform
         | properly from the ground up rather than trying to suck them out
         | of a noisy datasource for megabucks.
       | badrabbit wrote:
       | What distinguishes Matano'd existing or planned products from
       | Google Chronicle? Would you have any limits on data ingestion or
       | retention?
       | Also, python detections sounds horrible! I love python but it
       | sounds like you haven't considered the challenges of detection
       | engineering. This one of my main "expertise" if you will. You
       | should think more in the lines of flexible sql than python.
       | People who write detection rules to the most part don't know
       | python and even if they do it would be a nightmare to use for
       | many reasons.
       | I hope someone from your team reads this comment: DO NOT try to
       | invent your own query language but if you do, DON'T start from
       | scratch. Your product could be the best people who like the
       | fabulous splunk need to also like it. And for a security data
       | lake, you must support Sigma rule conversion into your query/rule
       | format. Python is a general purpose language, there are very good
       | reasons why no one else from Splunk,elastic, graylog,
       | Google,Microsoft use Python. Don't learn this hard lesson with
       | your own money. Querying it needs to be very simple and most
       | importantly you need to support regex with capture groups and the
       | equivalent of "|stats" command from splunk if you want to quickly
       | capture market share. I have used and evaluated many of these
       | tools and have written a lot of detection content.
       | Your users are not coders, DB admins or exploit developers. They
       | are really smart people whose focus is understand threat actors
       | and responding to incidents -- not coding or anything
       | sophisticated. FAANG background founders/devs have a hard time
       | grasping this reality.
         | shaeqahmed wrote:
         | Some big differences:
         | - Matano has realtime Python + SQL detections as code with
         | advanced correlation support. Chronicle uses inflexible YARA-
         | like detection rules iirc
         | - Matano supports Sigma detections by automatically transpiling
         | them to the Python detection format
         | - Matano has an OSS Vendor Agnostic Security Data Lake and can
         | work with multiple clouds / let's you bring your own query
         | engine (Snowflake, Spark, Athena, BigQuery Omni). Chronicle is
         | a proprietary SIEM that uses BigQuery under the hood and cannot
         | be used with other tooling.
         | There are no limits on data retention or ingestion with Matano,
         | it's your S3 bucket and the compute scales horizontally.
       (page generated 2023-01-24 23:00 UTC)