[HN Gopher] Launch HN: API Tracker (YC W20) - Track and manage t...
       Launch HN: API Tracker (YC W20) - Track and manage the APIs you use
       Hey HN!  We're Cameron, Trung and Matt from API Tracker
       (https://www.apitracker.com). We make tools to help with using
       third-party APIs in production.  When software teams integrate with
       APIs they often run into outages, network issues, interface changes
       or even bugs that cause unexpected behavior in the rest of their
       system. These problems are hard to predict and prepare for so most
       teams don't deal with them until there's a outage and have to do an
       emergency build to add logging and get to a root cause.  This is
       what happened to us. Trung and I are both software engineers and we
       spent a lot of time and energy trying to make our API integrations
       robust and reliable in production. We found ourselves instrumenting
       all our API calls so we could know how many calls we were making,
       how long they were taking and if they were failing. We set up
       alerts for errors and latency increases and integrated with
       PagerDuty. We wrote retry logic with exponential backoff. We wrote
       failover from one API provider to another. At the end of it all we
       built a lot of tooling that required maintenance and wasn't even
       applied uniformly across all of our integrations.  After building
       all this infrastructure we realized that many other teams are
       reinventing the same wheel.  To solve this problem we built an API
       proxy that takes requests and relays them to the API provider. By
       proxying this traffic we are able to instrument each call to
       measure latency, record status codes, headers and bodies, and add
       reliability features like automatic retry with exponential backoff.
       From there we can monitor and alert on issues and provide a
       searchable call log for debugging and auditability.  We knew that
       because we were asking teams to run their mission critical API
       calls through us that we had to build a highly available and
       scalable proxy architecture. We've done this by designing a proxy
       that can be distributed across multiple regions and clouds. We are
       currently running out of AWS. Global Accelerator allows us to use
       their private internet backbone to quickly get traffic to our
       proxies which run behind AWS Network Load Balancers. While this can
       help us ensure resilience against infrastructure outages, we also
       need to protect against self-inflicted wounds like bugs and bad
       deployments. Upon release we bring up a new set of proxy instances,
       deploy the code, and run our full test suite to make sure that each
       instance is able to proxy requests correctly. Once all instances
       are healthy they begin to go into the load balancer.  For companies
       with more stringent needs we support on-premise installations as
       well as a client-side SDK that can do instrumentation without the
       proxy.  Today we offer the service as a subscription. We hope to
       make it easy for teams to get visibility and control across all
       their integrations without having to build it themselves. This
       includes:  - Detailed logging on all of their third-party API calls
       - Monitoring and alerting for increased latency and error rates  -
       Reliability features like automatic retry, circuit breaker and
       request queueing  - Rate limit and quota monitoring  We would love
       to hear from the community how you are managing your API
       integrations. Our story is a result of our experiences and how we
       dealt with them, but we know the HN community has seen it all. We
       would love to hear from you about problems you've had and how you
       dealt with them. Please leave a comment or send us an email to
       founders@apitracker.com. Looking forward to the discussion!
       Author : cameroncooper
       Score  : 69 points
       Date   : 2020-02-18 18:00 UTC (5 hours ago)
       | openthc wrote:
       | We've had to build similar tools -- but one step further to make
       | three different upstream services behave in a common way. We also
       | added pre&post flight error checking for cases where the backend
       | wouldn't behave nice.
       | Any plans to "commonize" some different-backends like Twilio /
       | Plivo, or SendGrid, Mandrill, etc, etc?
       | Very nice work!
         | cameroncooper wrote:
         | Thanks for sharing your experience, we have heard similar
         | things from other companies. We do have plans to create common
         | interfaces for different services like SMS/email as you have
         | suggested. This will allow us to seamlessly fail-over between
         | providers to maintain uptime and performance without any action
         | on the client part.
       | orliesaurus wrote:
       | There have been a number of players in this area throughout the
       | years (Galileo [RIP], Runscope [semi-RIP], Newrelic just to
       | mention a few) for the analytical part ... and countless more for
       | the proxying part (Kong, Envoy, Tyk, etc)
       | Can you elaborate a little bit more where you place yourself in
       | the market? Why should someone trust you over any of the bigger,
       | older and more stable competitors? Thanks
         | cameroncooper wrote:
         | You're right that there are a number of proxy solutions out
         | there, but most are focused on exposing an API for external
         | consumption (i.e. API producers). We think that by focusing on
         | outbound API calls we can go deep on features that make less
         | sense in those products. The same is true for the analytics
         | solutions (i.e. Newrelic). For example it wouldn't make sense
         | for them to add automatic retry or request caching, but its
         | still a common pain point with integrations and makes a lot of
         | sense for us to build. Finally, some of the tools (i.e.
         | Runscope) are meant for development debugging and don't solve
         | the production pain point.
           | thorgaardian wrote:
           | What you described in the first sentence is commonly referred
           | to as an API gateway - protecting ingress traffic into a
           | publicly accessible service/app (e.g. Kong, AWS API gateway,
           | Ambassador, etc). Lately there's been a lot more generalized
           | solutions in this category for inter-process communication
           | via service meshes like Istio, Gloo, AWS AppMesh, and others
           | - all of which seem to offer a solution that works for both
           | internal traffic routing as well as external (when
           | whitelisted).
           | Can you offer a description of your product that
           | differentiates it from service mesh solutions? Did you build
           | your own proxy software, or are you built on top of Envoy
           | like many of the other available solutions?
             | cameroncooper wrote:
             | We are not built on top of Envoy and have built our own
             | proxy.
             | Many of the service mesh solutions require you to deploy
             | and manage them as an on-premise installation. Our primary
             | offering is a hosted solution, but also offer a managed
             | service for on-premise installations.
             | As you've correctly pointed out the service mesh solutions
             | can allow routing of external traffic, but by focusing on
             | the external calls there are features that make sense for
             | us to build that wouldn't make sense in something like
             | Istio/Gloo/AppMesh. For example, we can build an enhanced
             | experience around third-party APIs to better understand the
             | calls, errors, quotas, etc that are specific to that
             | provider.
               | candiddevmike wrote:
               | Why did you build your own proxy instead of using envoy?
               | What short comings did envoy have?
               | cameroncooper wrote:
               | We wanted to architect a system that made it easy to
               | deploy proxy nodes to multiple regions and clouds. We
               | also wanted it to be easy to add functionality specific
               | to our feature set. While we might have been able to
               | achieve our goals by modifying an existing proxy, it made
               | more sense to us to build our own. I have built proxies
               | in previous companies and this was something I was very
               | comfortable doing.
               | candiddevmike wrote:
               | Can you expand on what specific part of envoy prohibited
               | that?
               | Additionally, as other commenters mentioned, almost every
               | company has rallied around Envoy and is spending
               | considerable time/money making it better. If your
               | solution isn't as performant as envoy, it seems like a
               | poor architecture choice to roll your own, especially
               | given the time/money constraints startups have.
               | divbzero wrote:
               | Congratulations on the launch!
               | And thanks for your explanations on how your proxy is
               | similar to and different from API gateway or service mesh
               | solutions.
               | Having worked on both production monitoring and an API
               | gateway for a Fortune 100 company, I would consider
               | monitoring and proxy to each be valuable in its own right
               | and can envision scenarios where I'd want a standalone
               | product offering for one but not the other.
               | thorgaardian wrote:
               | That last paragraph is an interesting addition I handn't
               | considered actually, so great answer! While I'd be
               | hesitant to use a 3rd party, hosted solution for this use
               | case, I can also see how that affords you the ability to
               | optimize fullfilment of requests per destination across
               | all your users. Is it safe to assume that long term
               | you'll offer this to larger customers via private
               | installation to alleviate security and latency concerns
               | while still benefiting from the destination knowledge of
               | the central hub to configure routing rules?
       | FanaHOVA wrote:
       | Started using apitracker a week or two ago; it's been great for
       | logging requests and inspecting failed/slow ones. Haven't tried
       | automated retrying yet, but excited to do that soon as well.
         | cameroncooper wrote:
         | Glad it's been able to help! Please let us know if there's
         | anything else we can do.
       | tonylucas wrote:
       | Have just signed up, was (yet again) looking for a solution like
       | this for monitoring outbound API calls. Look forward to trying it
       | dolftax wrote:
       | We've been using API Tracker in production for few weeks now. The
       | primary use case for us is to reliably handle webhooks from
       | GitHub which our product relies heavily on (app installation,
       | commit and pull request events).
       | Unfortunately, GitHub doesn't retry any failed webhooks and when
       | our service goes down for a few seconds, thousands of webhooks
       | fail and pile up. GitHub doesn't provide an API to query the
       | failed webhooks and retry as well. We had to go through the
       | painstaking task of visiting GitHub's app dashboard and click
       | retry on each webhook, one by one.
       | With API tracker in place, we've updated our GitHub app's webhook
       | delivery URL to send the webhooks to API tracker and they forward
       | it to our services. In worst case when our service goes down for
       | a while, API tracker gracefully retries all the failed webhooks.
       | Ref: https://github.community/t5/GitHub-API-Development-
       | and/Handl...
         | thorgaardian wrote:
         | Interesting use-case for it. Without prior knowledge of a
         | solution like this I would have suggested you send the webhooks
         | to a queue backed notification system (e.g. SNS backed by SQS)
         | and subscribe to the event topic, but sounds must easier to
         | configure and manage the way you instrumented it. Might be a
         | good use-case for me to try out!
           | capableweb wrote:
           | Yeah, this is what I've seen most services who rely on
           | webhooks from another service to do. Add in some monitoring
           | of how many events are not yet processed (set a alarm when
           | there is X amount of events in it) and you're done!
             | disposedtrolley wrote:
             | We're currently building a GitHub integration which
             | receives webhooks and kicks off a bunch of processing
             | actions based on the event type. Your suggestion sounds
             | like a great way to add some observability to the service
             | -- thanks!
           | cameroncooper wrote:
           | This is something you can easily configure with our automatic
           | retry function. We have an option to return a pre-configured
           | response to the caller, and put the request in a queue to be
           | retried until successful. This allows you to have a sustained
           | outage while making sure all calls are eventually delivered.
             | ignoramous wrote:
             | > This allows you to have a sustained outage while making
             | sure...
             | Re-driving queue backlogs at services recovering from
             | sustained outages ends in tears almost always. Tread
             | carefully. :)
         | bpicolo wrote:
         | > In worst case when our service goes down for a while
         | The worst case is still the same, no? API tracker goes down,
         | GitHub has no redelivery, same deal. More a matter of whose
         | uptime you trust more in this regard.
         | (That's not to say it's not valuable for this use case)
           | dolftax wrote:
           | Sure. The least we expect from any service sending webhooks
           | is built-in retry strategy. GitHub doesn't. We were thinking
           | of building this ourselves internally but if someone takes
           | care of this for you reliably, why not.
           | For API tracker, even if their services go down for a short
           | while, it isn't good for business. Though it's been only few
           | weeks using API tracker, we had zero failed webhook
           | deliveries. They say they've designed their systems with this
           | as a primary goal, of course. What if AWS or GCP goes down.
           | It's a matter of trust and SLAs.
         | ignoramous wrote:
         | Thanks.
         | At $349 for 1M calls, doesn't it get expensive? I'd reckon,
         | web-hooking it to Step Functions + AWS Lambda or SNS + SQS
         | would have been a much cost effective solution at the cost of
         | additional resources devoted to development and maintanence, of
         | course. So, if you're comfortable sharing, what did the TCO
         | economics look like for you when you decided to use ApiTracker
         | instead?
       | ignoramous wrote:
       | Would it be right to say this is sentry.io meets envoy, grpc, and
       | konghq? Super interesting. Congratulations.
       | How do I manage my API integrations, you ask?
       | Global Accelerator (GLA) is a key infrastructure piece for a HA
       | service I'm building but for the data-plane. It is such a hassle-
       | free but slightly expensive way to vend anycast IPs (no need to
       | purchase ASNs and/or announce routes from colos across the globe)
       | and have the traffic load-balanced to 25+ AWS regions, that I
       | recommend it instantly to anyone architecting HA services.
       | https://fly.io and https://stackpath.com/edge-computing are
       | viable alternatives. Cloudflare announced MagicTransit which
       | isn't as smooth as AWS GLA in terms of developer experience,
       | whilst Azure and Google offer global-load-balancers, too, and may
       | be even before AWS announced it in 2018? So, really, I think
       | utilizing GLA is something folks should do if they run global HA
       | services. The only issue with using NLB behind AWS GLA is the
       | client-IP is not preserved. In our case, we needed it, so we had
       | to get creative with sticky routing and port assignment
       | (listeners) to do load-balancing / traffic-shaping.
       | Another HA trick I plan to employ is to use Cloudflare-Workers
       | (200+ PoPs) to front https-traffic to our control-plane
       | endpoints. It lacks observability, monitoring, and alerting
       | unless you're on Cloudflare's enterprise plans. The rate-limiting
       | option is expensive ($0.05 per 10k good requests). I'm sure
       | there's no way to queue requests out-of-the-box, so I can very
       | much see a need for what you've built, and where you guys fit in.
       | To be honest, I'd be surprised if firebase or API Gateway or
       | KongHQ don't already do what you do, as well. Is that case? If
       | so, keep at it. It is a real need. And as you point out,
       | something that I've _had to_ build for every service and
       | integration point.
       | A few questions (I went through your website and docs, but here I
       | am):
       | - How do you handle secrets that the clients might need to share
       | with your service, like Apikeys or Access/SecretKeys?
       | - Do you also push logs to the customers in addition to them
       | pulling it from your endpoints / UI?
       | - A bit curious about your logging, monitoring, and alerting
       | infrastructure-- Is it ran on top of CloudWatch or Prometheus or
       | Loggly or Elasticsearch or Lightstep or...?
       | - Do you support proxying http/REST APIs only?
       | https://autocode.stdlib.com/ which was discussed a few weeks ago
       | here looks, to me, like a good addition to what you're building.
         | cameroncooper wrote:
         | Thanks for sharing your experience. We love GLA as well.
         | Great questions.
         | - For sensitive fields that you do not want retained or
         | searchable, we can mask them out.
         | - We don't currently have integrations to push our logs to
         | another service, but this is a good use case for us and it's on
         | our near term roadmap.
         | - We use Elasticsearch in the product, but we also use
         | CloudWatch extensively for our own operations.
         | - Right now we only support proxying HTTP requests, but are
         | open to supporting other protocols.
       | thdxr wrote:
       | This is great, I can see the potential of something like this and
       | am jealous I'm not the one working on it!
       | Don't take the pushback in the other comments too seriously.
       | There is definitely an audience (myself included) who'd want a
       | focus, specific tool
       | sachinag wrote:
       | https://cloud.ibm.com/catalog/services/api-connect seems to do a
       | lot of this for free. Probably could also use the community
       | version of Mulesoft: https://developer.mulesoft.com/mulesoft-
       | products-and-licensi...
         | erik_landerholm wrote:
         | Two of the last companies I'd ever want to work with or rely
         | on, other than that...
       | derricgilling wrote:
       | We at Moesif (https://www.moesif.com/solutions/track-third-party-
       | api) released a similar tool in 2017 and found that many of our
       | customers including Deloitte, UPS, Snap Kitchen, iFit, and
       | Trung's previous company, Snap Kitchen were looking for a way to
       | track APIs without the complexity of a full service mesh like
       | Envoy. Especially if you're hosted in something that cannot run
       | an on-prem service mesh or gateway.
       | We're a little different in that we also support agent-based
       | rather than just proxy. Meaning we have an SDK that sits out-of-
       | band.
       (page generated 2020-02-18 23:00 UTC)