### About the Role
We are looking for a founding engineer to have ownership of the front-end
design and implementation of our products. We expect you to have top technical
ability, high autonomy, and the dynamism necessary to build out essential
components quickly.
There are two primary areas for which you will be responsible.
(1) Furthering and maintaining our public benchmark available at
<https://www.vals.ai>. The initial version of this site has garnered some
attention. There are significant ways we can improve the public reports,
making them the status quo for evaluating LLM performance on enterprise tasks.
(2) Building out our evaluation platform, which is used by companies to
produce internal benchmarks. This is the interface that is used by subject
matter experts to perform their review, then automate the evaluation for
subsequent runs.
It’s worth highlighting that in this role _you are a founding team member, not
an employee_. This means you have ownership in the company and product
direction. You don’t take orders, you engage in discussions. We are growing
quickly and you will help hire early engineers to support your work.
### Requirements
* 3+ years of industry experience building websites with React with a significant portfolio of prior work.
* Experience working in teams. This includes working in development sprints, knowledge of best practices in working with Git, and reviewing pull requests.
* Strong communication skills. You can provide input to others and equally receive/integrate feedback.
* The tenacity to iterate and ship quickly.
* We are an in-person team, based in San Francisco. We will support your relocation or transportation as needed.
### Nice to haves
* Strong sense of design / aesthetic. By working with existing libraries and components, you can build out features rapidly without hand-holding input from a designer.
* Experience writing endpoints with Django (Python). If components of the frontend are dependent on backend infrastructure, you can occasionally modify that code yourself.
* Familiarity with LLM methods and developments. Innate interest in the space will make it easier to build a valuable product.
* Experience with Typescript.
### About Us
Measuring model ability is the most challenging part of creating applications
that are capable of automating any given part of the economy. There **are no
good techniques or benchmarks for evaluating LLM performance on business-
relevant tasks** , so adoption for enterprise production settings has been
limited (see Wittgenstein’s ruler).
This problem materializes in each place where LLMs have potential: in
understanding whether the AI tool companies are building a product will
satisfy a customer demand, determining how feasible models and vendors are for
a given enterprise in making purchasing decisions, for researchers who need a
north star to which to expand model ability.
Today, answering these questions amounts to hiring a human review team to
manually evaluate model outputs. This is prohibitively expensive and slow.
Vals AI is building the **enterprise benchmark** of LLM and LLM apps on real-
world business tasks. In doing so we are creating the **infrastructure +
certification to automatically audit LLM applications** , verifying they are
ready for consumption.
See [our benchmarks](https://www.vals.ai/) and [launch
announcement](https://www.bloomberg.com/news/newsletters/2024-04-11/this-
startup-is-trying-to-test-how-well-ai-models-actually-work?srnd=undefined) in
Bloomberg. We aim to build the barometer for whether AI is useful, and in
doing so, accelerate the automation of all knowledge work.
### What we are building:
Our core technology enables us to review + automatically audit LLM
applications in high value industries (legal, insurance, finance, healthcare).
With this and our own data, we maintain a public benchmark of the major LLMs
on enterprise tasks. Our success will be based on three components:
1. Our evaluation performs at human-level accuracy on the relevant axes for each industry/application.
2. Our platform has an intuitive interface that acts as a shared platform between human reviewers and engineers.
3. We become the industry-standard benchmark, maintaining a loss-leading effort by publishing free reports and collaborating with credible data partners.
To achieve each of these, we are looking for machine learning engineers (Head
of AI, Members of Technical Staff) to develop novel evaluation techniques,
strong designers and front-end engineers (Founding Product Engineer) to
contribute to the platform, and a tenacious operator to write reports and
maintain our social media (email [rayan@vals.ai](mailto:rayan@vals.ai) if this
is of interest).
### What we offer:
* Highly competitive salary and meaningful ownership. Excellence is well rewarded.
* Relocation and transportation support.
* Health/vision/dental insurance coverage.
* Lunch and dinner provided, free snacks/coffee/drinks.
* Unlimited PTO.
### About us:
**Founding team:** The core methodology behind this platform comes from NLP
evaluation research we had done at Stanford. We raised a 5M seed from some of
the top institutional and angel investors in the valley. Our team has prior
work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we
have over 300 citations in our published work.
**Tech stack:** Our frontend is built in React with TSX. We use Django as our
back-end framework. All of the infra is on AWS.
### What we’re looking for:
* **Intelligence** is more important than a good-looking resume. Industry experience and pedigree valuable only insofar as it is a proxy for talent itself.
* **Ownership** to create products. We don’t have the scale or time to actively “manage” every project or task. Working in a small, talent-dense team, we expect everyone to show initiative to build where it’s needed, not where it’s asked. We strive for autonomy over consensus.
* **Intensity**. The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier, enterprises are seeing massive pressure to adopt technology, startups are hungry to chase the white space. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution.
* **See solutions not problems**. We’re not looking for people that pass hard problems on to others or admit defeat, but instead only see the opportunity to craft solutions at each juncture.
### Further Reading:
* [Hugging Face blog on evaluation](https://huggingface.co/blog/clefourrier/llm-evaluation?utm_source=ainews&utm_medium=email&utm_campaign=ainews-to-be-named-4285)
* [Anthropic’s blog on challenges in evaluation](https://www.anthropic.com/news/evaluating-ai-systems)
* [New York Times article on issues in benchmarking](https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?unlocked_article_code=1.kk0.2YY4.pu0LWd2Di99q&smid=nytcore-ios-share&referringSource=articleShare&ugrp=m)
* [Stanford HAI report showing hallucinations in legal tech tools](https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-or-more-benchmarking-queries)
### Referral Bonus
Know someone who would be a good fit? Connect them with
[rayan@vals.ai](mailto:rayan@vals.ai). If we hire them and they stay on for 90
days you’ll get a $10,000 referral bonus and Vals AI merch!