A platform where facial recognition and biometric analysis are combined with artificial intelligence to transform well-being.
Mark Twain famously once said that “There are lies, damn lies, and statistics.” With the rise of Artificial Intelligence and Machine Learning, it's easy to forget that when an algorithm returns a prediction, well...that's just what it is: a prediction with a certain degree of confidence. The real danger is only considering the main result and entirely forgoing the other options.
At Okaya we take this to heart. Mental health is a very complex problem and cannot be seen as an absolute, black-and-white proposition.
When working on our algorithm, we are most interested in the outliers, the things that do not work, the data that do not fit the pattern. We feel this is the strongest way to be proactive about finding bias and understanding the limitations (and therefore the future efforts) for our algorithm.
Before reviewing any other elements of the algorithm, we always start by taking a look at the quality of the data used to feed the algorithm. The quality of the data is critical to the overall quality and reliability of the algorithm.
This is why our primary source is our own internal research. This is the most reliable data set because we can decide on the mix we collect, who we collect it from, and when it is collected. We can also offer the participants full anonymity and privacy guarantees, which is often the tipping point when collecting clean data.
The next layer of research data we consider as credible includes data submitted during research and IRB studies. We give this data (that we help design, collect, and manage) a lot more weight but still know there are some outliers and external scenarios that will not always be covered in these samples.
Finally, there is always the option to include data that can be purchased from external vendors or collected from certain sites. We firmly believe in not using these resources because there are simply too many questions that the data brokers cannot precisely answer.
Computer vision has progressed tremendously in the past few years and we are only at the infancy of what can be done in the field. To give you an idea of how precise the field is becoming, there are now ways to capture someone's heart-rate using a camera similar to what you'd find in a Microsoft Connect device. This is done simply by looking at changes in the skin pigmentation as blood flows through! Yet, just because the field is progressing does not mean it is perfect. For example, when the pandemic started, many algorithms could not deal with the fact that people were wearing masks.
We pay particular attention to making sure we "see" our subjects properly. One of the methods for us to do so is by making sure we properly see participants' blinks.
Blinks are very relevant for a few reasons:
From a pure science perspective, countless studies, especially in transportation, have shown the link between blinking rates and fatigue. As an aside, these studies are always done in clinical environments when the face of the subject is properly lit, the person does not move much etc, etc...in other words, environments that never happen in real life!
But blinks are by nature fast - really fast. A fraction of a second. So we know that if the algorithm properly calculates the number of number of blinks from a subject, then the rest of the facial landmarks we are considering are also accurate. To do so, we manually label our research samples with the number of blinks the subject has over a given period of time. We then compare it to what our algorithm calculates to gauge its accuracy.
We pay particular attention to samples where our algorithm is not accurate to less than 2 blinks. What we've found so far is that these "failed scenarios" share one or more of the following conditions: sometimes the subject's head moves a lot, and if the camera used is not new enough the camera does not capture this movement well. As a result, the quality drops quite a bit as the algorithm can become confused between true blink movement vs. a head movement that resembles a blink.
At the other end of the camera quality spectrum, we also have an interesting use-case when receiving data from high quality devices: the subject does a blink within a blink. These are barely visible to the human eye in real time. Think of them as having somewhat of a "W" shape.
These types of blinks can really fool not just the computer vision system but also the entire algorithm.
The trick in these instances is to properly differentiate between a "W" blink and a regular double blink and properly account for them.
Distance is a factor when it comes to quality and accuracy. We've found that the best accuracy comes when the subject is between 1 and 5 feet of the camera. In this scenario, the results are accurate regardless of other elements (movement, etc..). Beyond that threshold, the quality drops quickly.
Lighting plays a more limited role than we expected in our original assumptions. Obviously if a video is shot entirely in the darkness, it's not going to work! But computer vision generally performs quite well across different ranges of exposure. And this is even without modifying the video intake at all.
What a person says and how they say it is important of course. But before including this information in the algorithm we also check for a few things.
Sometimes people stay silent even when they are supposed to say something. This, in and of itself, is a tell-tale. So step one is making sure we identify if the person is really speaking. We are accurate 97% of the time. We do this by using labeled data and comparing our algorithm's prediction with the label. This accuracy includes scenarios where there is a TV playing in the background or a second person speaking. The other 3% can be attributed to what we call the ventriloquist effect - if someone can masterfully "lip-sing," it is very difficult for an algorithm to detect the fake.
Now that we know they are speaking, are the subjects understandable? This question is quite tricky. To understand the scope of the challenge take a language like English. Start with native speakers: Someone from the United States sounds distinctly different from someone from Wales or Australia. And once you add non-native speakers to the mix, the variations are even more diverse and certain accents will throw you off (Scottish for example!). Voice AI is no different. Will it eventually be close to perfect? Most likely, but for now it is still unreliable at times.
In our analysis we see a drop in comprehension quality of about 10% between a native and non-native speaker. There is also some variation between heavy native accents and more neutral native accents. Gauging "understandability" is very important because it is at the root of sentiment analysis, or any computation really. Additionally, algorithms have a tendency to default to classifying someone as "neutral" when it does not understand someone, thus creating bias and inaccurate results.
One known limitation of our algorithm at this time is that we do not do any facial or voice identification of the subject. Meaning if we're expecting a check-in from Pat but instead Jill does it, we will not pick that up and we'll look at Jill's results.
Can it be a problem? Yes and no. It really depends on the use case you are trying to address. Given the scenarios our technology is used for, the trade between privacy and accuracy would not benefit our customers and would also open the door to many personal monitoring options many users do not want to face.
We, humans, are very nuanced - ever changing organisms, and these constant variations are expressed in our health. Yet, when an assessment is done, it reflects a monochrome snap-shot of where we are at and often discounts any variation.
More problematic is that algorithms often only return a very black-and-white result as to whether someone falls under a given condition. This is not just a computational issue but a classification issue that has been going on in health care for a long time. For example, in 1917, the U.S. Public Health Service printed a list of over 60 health conditions – from anemia to varicose veins – that doctors could spot during the brief line inspection. How brief was the line? 6 seconds! That was the amount of time doctors spent with people as they were being processed by immigration at Ellis Island.
In our algorithm, we aim at giving a more nuanced view of where the subject is at. Specifically, it means always considering the implications of loss vs. accuracy, as well as false positives vs. false negatives, in our algorithm and the proper identification of outliers and markers.
Because we deal with mental health and the extreme consequences associated with these struggles, we voluntarily err on a more conservative side when it comes to assessing someone's condition.
We encourage you to bookmark this page as it constantly evolves based on our latest research and findings.