How to Spend a Few Nice Nights with Optical Character Recognition (OCR)
I recently worked on an interesting problem: recognizing text from a photo of a book and making it selectable in a web application. As a book reader, I found this problem interesting because it would make taking notes from books much easier, allowing me to read more and have better notes. In the following article, I will go through three short stories about solving the mentioned problem.
For reference, here is the entire process:
Also, here is how it looks right now:
How to recognize text in a photo
It was a new problem for me. Fortunately, I was not naive, and the first thing that came to mind was to use an existing solution rather than implement it from scratch. The first thing that jumped out at me on Google was tesseract.js. Before delving deep into the library, I couldn’t resist my natural engineering instincts and started writing code to play with it. Within a few hours during one March evening, I had a working prototype. It performed decently, but the library couldn’t recognize all the text; some words ended up looking more like something an alien would say rather than English words.
Back to Googling (does anyone still use this word?). ChatGPT wasn’t very helpful in this case, and the prompt “write a program that would recognize text from a photo and make it selectable” was not helpful at all! The code didn’t work, even though everyone was saying that programmers would be obsolete in a few months!
Anyway, jokes aside, I discovered OCR from Google. I rewrote my proof of concept, or more precisely, I called an API that returned perfect results with all the text recognized. Maybe ChatGPT will replace all programmers (API/library callers) at the end of the day.
How to not use the result from OCR
Equipped with this massive success from one API call, I was ready to solve the next problem on the way: render the result (text) from OCR to the photo of the book’s page, and make it selectable. The result from OCR looks something like this (simplified):
{
data: [
{
dimensions: [x0, y0, x1, y1]
text: "the"
},
{
dimensions: [x0, y0, x1, y1]
text: "first"
}
]
}
You can see that the whole book’s page is broken into ‘words’ with dimensions of particular words. I thought, I will use the dimensions and render the text transparently above the text so users will be able to select it, like you select text on a page such as this one. Something like this:
const Word = ({ dimensions, text }) => {
const [x0, y0, x1, y1] = dimensions;
const style = {
left: `${x0}px`,
top: `${y0}px`,
width: `${x1 - x0}px`,
height: `${y1 - y0}px`,
};
return (
<div style={style} className="absolute text-transparent">
{text}
</div>
);
};
const App = () => {
return (
<div className="relative w-full h-screen">
{data.map((item, index) => (
<Word key={index} dimensions={item.dimensions} text={item.text} />
))}
<img src="page.png">
</div>
);
};
It worked, but it was as useful as sunglasses during a Scottish summer. You could carry it in your pocket, but you wouldn’t use it much. You couldn’t comfortably select the text. There needed to be a better solution, and indeed, there was!
How to use the result from OCR
I tried a few different ideas, but nothing was quite good enough. After some time experimenting with it, during one beautiful Saturday morning, I realized something interesting - the mouseover event! The idea was simple: if you hover over a word, it means you want to select that particular word. I quickly scrambled a prototype, and it worked amazingly well! It looked something like this (pseudo-code):
const handleOnMouseOver = (index) => {
if (!isStartMoving && !isEndMoving) {
return;
}
if (isStartMoving) {
if (index < endPosition) {
setStartPosition(index);
}
} else if (isEndMoving) {
setEndPosition(index);
}
};
return (
<>
{processedWords.map((word, index) => (
<div
key={`${word.text}-${index}`}
className="absolute"
style={{
top: word.uploadedY0,
left: word.uploadedX0,
width: word.uploadedX1 - word.uploadedX0,
height: word.uploadedY1 - word.uploadedY0,
}}
onMouseOver={() => handleOnMouseOver(index)}
onClick={() => onWordClick(index)}
data-word={word.text}
data-index={index}
/>
))}
</>
);
I played with it, and then I grabbed my phone and immediately started to feel sad. On mobile, the mouseover event
does not work! Luckily enough, you can use the touchmove event instead. You can easily ‘hack it’ to make it behave similarly to the mouseover event. See the following example:
const handleTouchMove = (event) => {
const touch = event.touches[0];
const element = document.elementFromPoint(touch.clientX, touch.clientY);
if (element) {
const index = element.getAttribute('data-index');
if (isStartMoving) {
setStartPosition(Number(index));
} else if (isEndMoving) {
setEndPosition(Number(index));
}
}
};
Done! The problem is solved, and it works! You can try it yourself if you are interested: ravenapp.ai.
Conclusion
It was fun to work on this atypical problem. It took me a few nights and one Saturday, but the feeling of solving it was great. What’s best, I can use the application to take notes from books!