Text OCR

AI Data Capture SDK

Overview

The TextOCR class leverages Optical Character Recognition (OCR) to detect, recognize, and group text from images. The Text OCR process consists of three stages:

  1. Text Detection - Identifies and filters text boxes within the image.
  2. Text Recognition - Reads and extracts text content from each identified text box.
  3. Text Grouping - Organizes recognized words into lines or paragraphs.

This guide begins with initial instructions for basic setup of OCR. Subsequent sections offer a comprehensive list of OCR settings that can be fine-tuned for specific needs:

For non-standard use cases, particularly for Grouper Settings, developers are encouraged to experiment with and adjust these parameters to optimize performance.

Additionally, the process() method of the TextOCR class can be utilized to detect and recognize text within images. This interface enables developers to build CameraX analyzers that integrate with other detectors.


AI Model

The text-ocr-recognizer model is designed for text recognition tasks. For more information and to download the model, refer to Text OCR Model.


Capabilities

Supported Characters

TextOCR recognizes a range of characters, including:

    *0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~*

By default, it supports a maximum word length of approximately 15 characters, though this limit may decrease with the use of uncommon fonts. Enabling tiling removes this restriction.

Input/Output

Input Parameters: The default model input size is 640x640 pixels, but this can be adjusted during runtime initialization.

Output Parameters: The output consists of a list of text detections, each accompanied by a list of complex bounding boxes that define the location and content of the detected text.


Configuration

Before starting with TextOCR, configure key settings such as model input size, resolution, and inference type. Changes to these settings will require reinitializing the models. Information on configuring these settings are provided in the sections that follow.

Model Input Size

The Model Input Size defines the resolution at which the AI processes images. Before analysis, images are resized to this dimension. Adjusting this size balances speed and accuracy.

Key Considerations:

  • Start with the default resolution 640x640 for optimal processing.
  • If the results are not sufficiently accurate for small text, increasing the resolution can improve precision.
Input Size Best For Use Case Consideration
Smaller (e.g., 640x640) Speed: Faster processing Large or close text Reduced accuracy for small text
Larger (e.g., 1600x1600) Accuracy: Better for details Fine print, distant, or dense text Slower processing, higher memory usage
Custom (Multiples of 32, e.g., 800x800) Balancing speed and accuracy Low-contrast or medium-sized text Requires experimentation to find the optimal setting


Resolution

Camera resolution refers to the number of pixels the device’s camera sensor can capture (e.g., 1MP = 1280x720, 8MP = 3840x2160). It determines the quality of the source image before any resizing occurs for AI processing.

Higher camera resolution provides a more detailed and higher-quality original image, which can significantly enhance the AI model's ability to detect and recognize small, faint, or distant text. However, this increased detail comes at the cost of greater processing power and memory usage.

Key Considerations:

  • Impact of Camera Resolution - Higher resolutions enhance input image detail, aiding in recognizing small, low-contrast, or distant text. However, images are downscaled to the model's input size for processing, so the benefits of high-resolution cameras diminish with low model input sizes.
  • General Guidance - Aim for a minimum text height of 16 pixels in the input image, adjusting for font size and camera distance from the target.
Resolution Best For Use Case Consideration
1MP (1280x720) Speed, power efficiency Large/simple text May miss fine details
2MP (1920x1080) General use Stylized or moderately detailed text Balanced performance
4MP (2688x1512) Detailed scans Contracts, forms, dense text Higher memory and battery use
8MP (3840x2160) Maximum detail Archival purposes Large files, diminishing returns for low input sizes


Relationship Between Model Input Size and Camera Resolution:

Resolution Low Input Size (e.g., 640x640) High Input Size (e.g., 1600x1600)
Low Resolution (e.g., 1MP / 1280x720) Speed: Fastest
Accuracy: Lowest
Use Case: Large, clear text close to the camera
Note: Small/fine details may be lost
Not Recommended: Wasted computation with little accuracy gain
High Resolution (e.g., 8MP / 3840x2160) Speed: Fast
Accuracy: Moderate
Use Case: Large/medium text, quick scans
Note: High-resolution source is downsampled for AI, so small text may still be missed
Speed: Slowest
Accuracy: Highest
Use Case: Detailed, small, dense, or distant text
Note: High memory and battery usage; may stress low-end devices


Inferencer Type (Processor)

The Inferencer Type specifies which chip on the device is responsible for performing AI computations (referred to as "inference"). This choice directly impacts the speed and efficiency of image processing.

Key Considerations:

  • DSP (Digital Signal Processor) - Use DSP if available, as it is specifically designed for real-time, energy-efficient AI tasks and provides optimal performance.
  • GPU (Graphics Processing Unit) - If DSP is not available, the GPU serves as an alternative for handling AI workloads efficiently.
Processor Description Performance Use Case Device Platform
DSP (Digital Signal Processor) Optimized for real-time, energy-efficient tasks. Ideal for specific AI workloads where battery life and efficiency are critical.
  • Best: Fastest and most efficient.
  • Preserves battery life during continuous use.
Best choice for real-time, low-energy tasks such as edge AI inference. Best for: SD6490, SD5430 FP2; for relevant device models, visit Zebra Platform Devices
GPU (Graphics Processing Unit) Designed for heavy, parallel AI tasks and complex models. Suitable for computationally intensive workloads.
  • Good: High-speed performance but consumes more power than DSP.
  • Handles large-scale parallel tasks effectively.
Best for handling complex AI models or tasks requiring significant computational power. Best for: SD4490, SD5430 FP1
Good for: SD6490, SD5430 FP2; for relevant device models, visit Zebra Platform Devices
CPU (Central Processing Unit) Acts as the fallback processor for AI inference tasks. Always available but less efficient compared to DSP and GPU.
  • Fallback: Slower and less efficient but always available.
  • Consumes more power, making it less suitable for continuous tasks.
Suitable for lightweight tasks or as a fallback when DSP or GPU are unavailable or causing issues. Fallback for: SD6490, SD4490, SD5430 FP2, SD5430 FP1; for relevant device models, visit Zebra Platform Devices

Developer Guide

This guide outlines the process for using TextOCR to detect and recognize text within images, from initialization to outputting the identified text.

Step 1: Initialization

Follow these steps to set up and initialize a TextOCR object:

  1. Import the TextOCR class: Use com.zebra.ai.vision.detector.TextOCR.

  2. Initialize the SDK: Use your application's context object and invoke init() from the AIVisionSDK class.

  3. Configure OCR Settings: Create a TextOCR.Settings object.

  4. Optional: Set model input dimensions: If needed, customize the model input dimensions (height and width). These should be multiples of 32 (e.g., 640). For guidance, see Model Input Size.

    settings.detectionInferencerOptions.defaultDims.width = [your value];
    settings.detectionInferencerOptions.defaultDims.height = [your value];
    
    • Smaller Input Sizes - Reduce processing time and increase speed, but may decrease accuracy. Ideal for larger or closer text.
    • Larger Input Sizes - Improve accuracy for smaller or more distant text, but increase inference time. An input size that is too large may cause out-of-memory errors and potentially cause an application crash at run-time.
  5. Optional: Configure the additional OCR settings to optimize detection and recognition:

  6. Initialize the OCR object - Declare a TextOCR object. Use CompletableFuture to initialize it asynchronously with an Executor for concurrent processing.

  7. Callback Handling - Use thenAccept() to assign the initialized TextOCR object to the textocr variable, enabling it for text detection tasks like barcodes and products in images.

Sample Code

Initialization sample code:

    import com.zebra.ai.vision.detector.TextOCR;     

    // Initialize the SDK 
    AIVisionSDK.getInstance(Context).init(); // Context refers to application context object. 

    //Initialize TextOCR settings object 
    String mavenModelName = "text-ocr-recognizer"; 
    TextOCR.Settings  settings = new TextOCR.Settings(mavenModelName); 

    // Optional: Override the default model input size 
    settings.detectionInferencerOptions.defaultDims.width = 1280; 
    settings.detectionInferencerOptions.defaultDims.height = 1280; 

    // Optional : set runtime processing order, by default DSP will be used 
    Integer[] rpo = new Integer[]{InferencerOptions.DSP}; 
    settings.detectionInferencerOptions.runtimeProcessorOrder = rpo; 
    settings.recognitionInferencerOptions.runtimeProcessorOrder = rpo; 

    // Initialize OCR object 
    TextOCR textocr = null; 

    // Initialize textocr 
    // settings = TextOCR.Settings object created above 
    // Executor = An executor thread for returning results 
    CompletableFuture<TextOCR> futureObject = TextOCR.getTextOCR(settings, executor); 

    // Use the futureObject to implement the thenAccept() callback of CompletableFuture 
    futureObject.thenAccept (OCRInstance -> { 
        // Use the Textocr object returned here for detecting barcodes/shelves/products 
        textocr = OCRInstance; 
    }).exceptionally(e ->{ 
        if (e instanceof AIVisionSDKException) { 
            Log.e(TAG, "[AIVisionSDKException] TextOCR object creation failed: " + e.getMessage()); 
        } 
        return null; 
    }); 

Step 2: Capture Image

Capture the image and ensure the image is in the form of a Bitmap. For CameraX-based applications, developers may build their own custom ImagerAnalyzers to feed a sequence of frames to the TextOCR interface. For more information, refer to CameraX.

Step 3: Recognize Text

There are two methods to recognize text within an image:

  • process() API Method: Suitable for applications requiring both text localization and recognition in a single operation. This method is particularly well-suited for integration with frameworks like CameraX to enable a streamlined workflow where image analysis and text detection occur simultaneously. Typical Use Cases:
    • Integration with CameraX - Used for application that utilize CameraX for image analysis. The process() method can serve as a detector for CameraX analyzers, enabling real-time text detection directly from camera feeds.
    • ImageData Objects - Accepts ImageData objects from various sources, including CameraX, Camera2 APIs, or local storage, offering flexibility in handling input images.
    • Organized Text Output - In addition to detecting text, the process() method organizes the recognized text into paragraphs, lines, and words, returning detailed ParagraphEntity objects.
    • Localization and Recognition - Ideal for scenarios where detecting and recognizing text paragraphs in one step is required, simplifying the process and improving efficiency.
  • detect() API Method: Suitable for applications that require straightfoward text detection without detailed structural information or for those working directly with bitmap images. This method offers a simpler interface for retrieving processed results asynchronously. Typical Use Cases:
    • Bitmap Images- For applications that primarily handle bitmap images, the detect() method enables direct input of bitmap data for text detection.
    • Basic Detection Requirements - Suitable for scenarios where generic text, words, or paragraphs required to be detected without additional structural details.
    • Asynchronous Processing - Supports asynchronous detection operations using executors, making it well-suited for applications that perform background processing.

Choose one of these methods to recognize text within an image.

> Method 1: Using Process() API

The process() method in the TextOCR class enables applications to pass an ImageData object and perform both text localization and recognition in a single operation, based on the provided settings. This interface is designed to function as a "detector" for CameraX analyzers and can be used alongside other detectors, such as the BarcodeDecoder.

Note: Applications can use the process() API even if they are not implementing the CameraX ImageAnalyzer interface. ImageData objects from other sources, such as Camera2 APIs or local storage, can also be passed to the process() API. In such cases, skip steps a and b below.

Steps to Use the process() Method:

  1. Implement ImageAnalysis.Analyzer - Develop a custom CameraX analyzer by implementing the ImageAnalysis.Analyzer interface.
  2. Override analyze() - CameraX continuously feeds frames to the analyzers that are bound to it. Override the analyze() method to define the specific functionalities required for your application.
  3. Prepare Inputs - The process() method requires an ImageData object. Use the helper methods provided by ImageData to convert source image types (e.g., ImageProxy, Android.media.image, or Bitmap) into the required format.
  4. Localize and Decode Paragraphs - Use the process() method to detect and decode paragraphs. The method outputs a ParagraphEntity object.
  5. Handle Results - Once the CompletableFuture completes, process the decoded paragraph. From the ParagraphEntity object, extract LineEntity lines using the getLineEntities() method. Similarly, extract words from the LineEntity object using the getWordEntities() method.
  6. Dispose of the Decoder - After decoding is complete and the TextOCR instance is no longer needed, dispose of the instance to release resources.

Sample Code:

    List<ParagraphEntity> resultList = textocr.process(ImageData.fromImageProxy(image)).get(); 
    for (ParagraphEntity entity: resultList) { 

        // Access detection confidence 
        float confidence = entity.getAccuracy(); 

        // Access bounding box 
        Rect boundingBox = entity. getBoundingBox(); 

        // Iterate over list of paragraph entity 
        for (ParagraphEntity entity : list) { 
            // Access lines from paragraph entity 
            Line[] lines = entity.getTextParagraph().lines; 

            // Iterate over list of lines 
            for (Line line : lines) {             
                // Access words from lines entity 
                for (Word word : line.words) { 
                    //Access the Bounding Box of the Word 
                    ComplexBBox bbox = word.bbox; 

                    //Access the text of the Word 
                    DecodedText[] decodedTexts = word.decodes; 

                    // Get the decoded text with highest accuracy at first index 
                    String decodedValue = word.decodes[0].content; 
                } 
            } 
        } 
    } 

> Method 2: Using detect() API

Use of detect() API allows a bitmap image to be passed and the processed results to be retrieved asynchronously as ComplexBoundingBox objects. This can then be parsed in the desired format – as generic text, words or paragraphs.

  • Generic Text - Outputs text in complex bounding boxes. Sample code:

    Bitmap image = ... // Your bitmap image here 
    
    // Initialize executor 
    Executor executor = Executors.newFixedThreadPool(1); 
    
    // Input parameters include a bitmap image and an executor thread object for performing detections 
    CompletableFuture<OCRResult[]> futureResult = textocr.detect(bitmap,executor); 
    
    futureResult.thenAccept (ocrResults -> { 
        // Process the returned output that contains complex bounding boxes and text within 
    
    
    if (e instanceof AIVisionSDKException) { 
        Log.e(TAG, "[AIVisionSDKException] Error in text detection: " + e.getMessage()); 
    } 
    return null; 
    
    }); // Once finished with the textOCR object, dispose of it to release resources and memory used during detection. textOCR.dispose();
  • Words – Outputs an array of words. A word is a discrete unit of text identified within an image, typically separated by spaces or punctuation. Sample code:

    Bitmap image = ... // Your bitmap image here 
    
    // Initialize executor 
    Executor executor = Executors.newFixedThreadPool(1); 
    
    // Input parameters include a bitmap image and an executor thread object for performing detections 
    CompletableFuture<Word[]> futureWords = textocr.detectWords(bitmap,executor); 
    
    futureWords.thenAccept (words -> { 
        // Process the returned array of detected words 
    
    
    if (e instanceof AIVisionSDKException) { 
            Log.e(TAG, "[AIVisionSDKException] Error in text detection: " + e.getMessage()); 
    } 
    return null; 
    
    }); // Once finished with the textOCR object, dispose of it to release resources and memory used during detection textOCR.dispose();
  • Paragraphs - Outputs a hierarchical structure of paragraphs using the grouping mechanism described in Grouper Settings. A paragraph is formed by grouping words that appear on the same line, and these lines are then organized into paragraphs. The process is parameterized, with relevant parameters detailed in the Grouper Settings. Sample code:

    Bitmap image = ... // Your bitmap image here 
    
    // Initialize executor 
    
    Executor executor = Executors.newFixedThreadPool(1); 
    
    // Input parameters include a bitmap image and an executor thread object for performing detection 
    CompletableFuture<Paragraph[]> futureTextParagraph = textOCR.detectParagraphs(bitmap,executor); 
    
    futureTextParagraph.thenAccept (paragraphs -> { 
        // Process the returned array of detected paragraphs. 
    
    
    if (e instanceof AIVisionSDKException) { 
            Log.e(TAG, "[AIVisionSDKException] Error in text detection: " + e.getMessage()); 
    } 
    return null; 
    
    }); // Once finished with the textOCR object, dispose of it to release resources and memory used during detection textOCR.dispose();

Best Practices

This section provides recommendations to improve recognition accuracy across a variety of use cases, from special characters and long words to handwritten text and numeric data. Strategic adjustments to input size, tiling, ROI, and other OCR settings can significantly enhance performance while balancing processing time and application requirements.

  • Improving Recognition Accuracy of Special Characters (e.g., '$') - Enable tiling and use higher resolutions to provide the model with more detailed input for processing.

  • Recognizing Isolated Characters in Confined Spaces - Increase the model input size and enable tiling for reliable detection of isolated characters, such as those within square boxes.

  • Handling Long Words and Numbers - Use larger input sizes and enable tiling to ensure complete detection of lengthy text strings (e.g., 20 to 45 characters) and improve recognition accuracy. Although enabling tiling may increase processing time, its benefits are:

    • Enhances detection of numbers, such as images of analog meters by helping to align and cover text within the display more accurately.
    • Reduces noise and improves accuracy if the OCR feature outputs junk data, especially in images with cluttered or overlapping text elements.
    • Enhances the mode's ability to handle text beyond typical recognition limits. Balancing higher resolutions and larger input sizes with processing time is crucial to meet application needs without unnecessary delays. Increased accuracy often requires longer processing, so finding the right balance is essential.
  • Improving Text Detection on Cylindrical Objects (e.g., Coca-Cola tin bottle)- Use the Region of Interest (ROI) technique to focus on specific areas of uneven surfaces, enhancing accuracy. If values are not accurate when reading from a distance (e.g., 3 feet or more), increase the input size for better precision, ensuring the model captures finer details necessary for accurate recognition at greater distances. If special characters and alphabets are not consistently appearing, adjust the minimum box size and box threshold to improve the detection of isolated characters and reduce ambiguity.

  • Improving Accuracy for Consecutive Handwritten Characters - Modify the unclip ratio to ensure accurate alignment and representation of character sequences. For incorrect numeric values decoded in OCR (e.g., tires), review and fine-tune OCR settings for numeric data accuracy.


Methods

TextOCR (Settings settings)

    TextOCR.TextOCR(Settings settings) throws IOException

Description: Initializes the OCR with the specified settings, allowing subsequent text detection and analysis on image inputs. It checks for the necessary model file and verifies the integrity of the archive. If issues are detected, appropriate exceptions are thrown.

Parameters:

  • settings TextOCR.Settings - An instance of the Settings class containing configuration options for the OCR engine.

Return Value: CompletableFuture<TextOCR>

Exceptions:

  • IOException - Thrown if the archive is corrupted.

detect (Bitmap srcImg, Executor executor)

    CompletableFuture<OCRResult[]> detect (Bitmap srcImg, Executor executor) throws InvalidInputException, AIVisionSDKException

Description: Performs Optical Character Recognition (OCR) on the provided Bitmap image, using the specified executor for asynchronous execution.

Parameters:

  • srcImg (Bitmap srcImg) - The Bitmap image to perform OCR on.
  • executor - Manages asynchronous task execution.

Return Value: A CompletableFuture that resolves to an array of OCRResult, each containing complex bounding boxes and recognized text.

Exceptions:

  • InvalidInputException - Thrown if the Bitmap is null.
  • AIVisionSDKException - Thrown if error in detection or image queue is full.

detectWords (Bitmap srcImg, Executor executor)

    CompletableFuture<Word[]> TextOCR.detectWords (Bitmap srcImg, Executor executor) throws InvalidInputException, AIVisionSDKException

Description: Detects individual words in the provided Bitmap image using the specified executor for asynchronous execution.

Parameters:

  • srcImg (Bitmap srcImg) - The image to analyze for word detection.
  • Executor - Manages asynchronous task execution.

Return Value: A CompletableFuture that resolves to an array of Word objects, each containing complex bounding boxes and possible text decodes.

Exceptions:

  • InvalidInputException - Thrown if the Bitmap is null.
  • AIVisionSDKException - Thrown if there is an error in detection or the image queue is full.

detectParagraphs (Bitmap srcImg, Executor executor)

    CompletableFuture<TextParagraph[]> detectParagraphs(Bitmap srcImg, Executor executor) throws InvalidInputException, AIVisionSDKException

Description: Detects paragraphs in the provided Bitmap image using the specified executor for asynchronous execution.

Parameters:

  • srcImg (Bitmap srcImg) - The image to analyze for paragraph detection.
  • executor - Manages asynchronous task execution.

Return Value: A CompletableFuture that resolves to an array of TextParagraph objects, representing detected paragraphs.

Exceptions:

  • InvalidInputException - Thrown if the Bitmap is null.
  • AIVisionSDKException - Thrown if the AI Data Capture SDK is not initialized.

getTextOCR (Settings settings, Executor executor)

    CompletableFuture<TextOCR> getTextOCR(Settings settings, Executor executor) throws InvalidInputException, AIVisionSDKSNPEException, AIVisionSDKException, AIVisionSDKModelException, AIVisionSDKLicenseException

Description: Asynchronously initializes and retrieves a TextOCR instance using the specified settings and executor.

Parameters:

  • Settings - An instance of TextOCR.Settings containing configuration options for the OCR engine.
  • executor - Manages asynchronous task execution.

Return Value: A CompletableFuture that resolves to an initialized TextOCR instance.

Exceptions:

  • InvalidInputException - Thrown if the settings are invalid or null.
  • AIVisionSDKSNPEException - Thrown if there is an error in the SNPE library.
  • AIVisionSDKException - Thrown if the AI Vision SDK is not initialized.
  • AIVisionSDKModelException - Thrown if the current SDK version is incompatible with the required version.
  • AIVisionSDKLicenseException - Thrown if there are licensing issues related to the text-ocr-recognizer model.

process (ImageData imageData, Executor executor)

    CompletableFuture<List<ParagraphEntity>> process(ImageData imageData, Executor executor) throws AIVisionSDKException 

Processes an image to detect text paragraphs, organizing the detected text into words, lines, and paragraphs. This method executes asynchronously and returns a CompletableFuture that can be used to retrieve the results once they are available.

Parameters:

  • imageData - The image data to be processed for text detection.
  • executor - Results are returned in this executor.

dispose()

    void dispose()

Description: Releases all internal resources used by the TextOCR object. This function must be called manually to free up resources.


TextOCR.Settings

The Settings class is a nested class within the TextOCR class, which leverages Optical Character Recognition (OCR) to detect, recognize, and group text from images. The flexibility of its parameters allows developers to fine-tune performance for diverse use cases, including document scanning, real-time recognition, and automated data entry.


Constructors

Settings(String mavenModelName)

    TextOCR.Settings textOCRSettings = new TextOCR.Settings(mavenModelName) throws InvalidInputException,AIVisionSDKException; 

Description: Constructor for the Settings object with model name.

Parameters:

  • mavenModelName - The name of the model specified in the Maven repository.

Exceptions:

  • InvalidInputException - Thrown if the mavenModelName is invalid.
  • AIVisionSDKException - Thrown if an error occurs while reading the specified model or the AI Data Capture SDK is not initialized.

Settings(File ModelFile)

    TextOCR.Settings textOCRSettings = new TextOCR.Settings(modelFile) throws InvalidInputException,AIVisionSDKException; 

Description: Constructs a new Settings object with File object passed.

Parameters:

  • ModelFile - The file object that contains the Text OCR model.

  • Exceptions:

  • InvalidInputException - Thrown if the modelFile is invalid.

  • AIVisionSDKException - Thrown if an error occurs while reading the specified model or the AI Data Capture SDK is not initialized.


Text Detection

The Detection phase processes the input image to create complex bounding boxes, or text boxes. Each text box is represented by a list of points forming a rotated rectangle, which may not be perfectly aligned with the screen’s edges. There may be more than four points if the rectangle is clipped at the edges of the screen. Adjusting Detection Parameters allows for improved accuracy, catering to specific use cases like document scanning, real-time text recognition, or automated data entry.

Typical scenarios for adjusting Detection Parameters:

  • Document Scanning: Digitize documents by extracting text for storage and retrieval.
  • Real-Time Text Recognition: Integrate into applications requiring immediate text recognition from images or video streams.
  • Automated Data Entry:  Simplify workflows by pulling text from forms, invoices, or other structured documents.

Detection Parameters

To refine detection accuracy, adjust the Detection Parameters.


detectionInferencerOptions

    InferencerOptions TextOCR.Settings.detectionInferencerOptions = new InferencerOptions()

Description: Allows developers to specify a different input shape for the detection stage inferencer.


recognitionInferencerOptions

    InferencerOptions TextOCR.Settings.recognitionInferencerOptions = new InferencerOptions()

Description: Typically remains unchanged as the input size is fixed for the recognition model. If needed, Recognition results can be adjusted using parameters in the Recognition Parameters section. Note: These options should not be changed by the developer.


Detection Process

The detection process operates in two main stages:

  1. Heatmap Threshold (Pixel-Level Filtering) - Filters pixels based on their likelihood of being part of text. A heatmap is generated where each pixel is assigned a score indicating the likelihood of it being part of a text character. The Heatmap Threshold filters out pixels with low scores, retaining only the most probable candidates for further processing.
  2. Box Threshold (Box-Level Filtering) - Groups filtered the pixels into bounding boxes and removes low-confidence detections. After pixel filtering, the system identifies groups of pixels and draws bounding boxes around them. Each box is assigned a confidence score, and the Box Threshold filters out boxes with low confidence, retaining only those likely to contain text.

Once potential text boxes are identified, additional filtering can be applied to refine results. This includes adjusting box size, area, and orientation to eliminate noise or unwanted detections and optimizing the detection for accurate text recognition. These refinements are acheived using Filtering Parameters.


heatmapThreshold

    Float TextOCR.Settings.heatmapThreshold

Description: Sets a cutoff to identify potential areas likely to contain text, converting them into text boxes. (Internally, the detector model creates a grayscale image, or heatmap, that represents text confidence.)

Tuning effect:

  • Increase Threshold - Reduces areas identified as text and reduces noise. Useful for high-contrast clear text such as scanned documents.
  • Decrease Threshold - Expands areas identified as text. Useful for faint, curved, or blurred text with low contrast.

Default: 0.5f

Valid range: [0.0f, 1.0f]


boxThreshold

    Float TextOCR.Settings.boxThreshold

Description: Sets the minimum confidence score required for a text box to be included in the OCR output. Boxes with confidence scores below this threshold are excluded, helping to filter out less certain text detections.

Tuning effect:

  • Increase Threshold: Excludes less-confident text boxes (reduces false positives), useful when too many boxes are detected.
  • Decrease Threshold: Includes more text boxes (catches weak detections), which might be necessary when important text is being missed.

Default: 0.85f

Valid range: [0, 1.0]


Filtering Parameters

minBoxArea

    Integer TextOCR.Settings.minBoxArea

Description: Filters out text boxes if their total area (width × height) is too small, filtering "tiny" boxes. This helps remove unimportant boxes from the OCR output.

Tuning effect:

  • Increase Parameter: Filters out boxes with small areas and eliminate dust, dots, or tiny artifacts.
  • Decrease Parameter: Helps to detect smaller text.

Default: 10 Valid range: [0, max(int)]


minBoxSize

    Integer TextOCR.Settings.minBoxSize

Description: Filters out text boxes that are too narrow ("skinny") or too short that likely do not contain real text.

Tuning effect:

  • Increase Parameter: Filters out very narrow boxes and helps ignore divider lines, underscores, or non-text lines
  • Decrease Parameter: Helps to detect smaller text.

Default: 1

Valid range: [0, max(int)]


minRatioForRotation

    Float TextOCR.Settings.minRatioForRotation

Description: Rotates vertically (high height, low width) oriented boxes so they become horizontal.
Note: Words are generally wider than they are tall, so their ratio should exceed the default value. Therefore, avoid changing this parameter for words, since word complex bounding boxes should be horizontally oriented before recognition.

Tuning effect: Setting this parameter to 0 disables rotation. Otherwise, rotate boxes with a height-to-width ratio exceeding this value 90 degrees counterclockwise before recognition.

Default: 1.5f

Valid range: [0.0f, inf] (where ‘inf’ denotes infinity)


unclipRatio

    float TextOCR.Settings.unclipRatio

Description: Expands or "stretches" detected boxes outward to include full characters and some background. Expands box size before recognition to improve results. Tight-fitting boxes might benefit from some extra background for better decoding.

Tuning effect: Increasing this parameter enlarges text boxes, potentially improving recognition. An unclipRatio of 1 keeps boxes unchanged, while 1.5 enlarges them by 50%.

  • Increase Parameter: For curved, rotated, or incomplete detections
  • Decrease Parameter: To avoid overlapping with neighboring text regions or noisy regions

Default: 1.5f

Valid range: [1.0f, inf]


Sample Code

This sample code demonstrates how to adjust detection parameter settings:

  1. Configure Settings: Initialize a TextOCR.Settings object and customize parameters such as heatmapThreshold and boxThreshold to improve detection accuracy based on your specific needs.

  2. Asynchronous Initialization: Use an Executor to initialize the TextOCR instance asynchronously, allowing for efficient resource management and responsiveness.

  3. Load Bitmap Image: Prepare the image for OCR by converting it to a Bitmap object.

  4. Perform OCR: Use the detect method to analyze the image and retrieve an array of OCRResult objects with complex bounding boxes and recognized text.

  5. Process OCR Results: Handle the results by iterating over the OCRResult array, outputting the recognized text or using it for further processing.

  6. Dispose Resources: After completing OCR operations, call dispose() to release resources and prevent memory leaks.

    import com.zebra.ai.vision.TextOCR; 
    import com.zebra.ai.vision.TextOCR.Settings; 
    import android.graphics.Bitmap; 
    
    // Initialize settings with a custom heatmap threshold 
    String mavenModelName = "text-ocr-recognizer"; 
    TextOCR.Settings textOCRSettings settings = new TextOCR.Settings (mavenModelName); 
    settings.heatmapThreshold = 0.3f; // Lower threshold for low-contrast text 
    settings.boxThreshold = 0.9f; // Higher threshold for more confident text boxes 
    settings.minBoxSize = 10; // Set minimum box size to 10 pixels 
    settings.minBoxArea = 50; // Set minimum box area to 50 pixels 
    settings.unclipRatio = 2.0f; // Enlarge text boxes by 100% 
    settings.minRatioForRotation = 2.0f; // Rotate boxes with height-to-width ratio exceeding 2.0 
    
    // Optional : set runtime processing order, by default DSP will be used 
    Integer[] rpo = new Integer[]{InferencerOptions.DSP}; 
    settings.detectionInferencerOptions.runtimeProcessorOrder = rpo; 
    settings.recognitionInferencerOptions.runtimeProcessorOrder = rpo; 
    
    // Initialize executor 
    Executor executor = Executors.newFixedThreadPool(1); 
    
    CompletableFuture<TextOCR> futureObject = TextOCR. getTextOCR(settings, executor); 
    
    // Use the futureObject to implement thenAccept() callback of CompletableFuture 
    futureObject.thenAccept (OCRInstance -> { 
        // Use the Textocr object returned here for the detection of barcodes/shelves/products 
        textocr = OCRInstance; 
    }); 
    
    // Load your Bitmap image 
    Bitmap image = ...; // Your input image 
    
    // Perform OCR 
    CompletableFuture<Result[]> futureResult = textocr.detect(bitmap,executor); 
    futureResult.thenAccept (ocrResults -> { 
        // Process the returned output that contains complex bounding boxes and text in it 
    }); 
    
    // Dispose resources 
    // Once use of the textOCR object is done, dispose it to release the resources and memory used for detection 
    textOCR.dispose(); 
    

Text Recognition

The Recognition stage analyzes the text within each complex bounding box, or text box, produced during the Detection Stage to identify the text content. Each text box results in a list of potential text decodes.

After text boxes are detected, the next step is to extract and accurately read the text within each bounding box. AI Suite uses the "Total" decoder to convert character predictions into meaningful words, even in cases where the model is uncertain about specific characters.

The "Total" decoder employs a systematic filtering process to refine character predictions, focusing on balancing accuracy and efficiency while assembling words. Adjusting the Decoder Parameters TopK Ignore Cutoff, Total Prob Threshold, and Max Word Combinations, act as filters to refine predictions and determine the final output.

Step-by-Step Process:

  1. Generate a Ranked List of Predictions for Each Character: For every character slot (e.g., a space in a word), the system creates a list of possible characters, ranked by confidence scores.
    • To explain how the parameters work together, consider the following example predictions to apply for this process: 'S' at 40%, 's' at 30%, '5' at 15%, 'B' at 5%, '8' at 2%
  2. Apply Two Filters to Refine Predictions:
    • First Filter - TopK Ignore Cutoff (The Gatekeeper): Limits how many of the highest-confidence character predictions are considered for each character slot.
      • Example: If the cutoff is 4, only the top 4 predictions ('S', 's', '5', 'B') are kept. Predictions below the cutoff (like '8') are discarded.
    • Second Filter - Total Prob Threshold (The Quality Check): Ensures the cumulative confidence of the retained predictions meets a defined minimum threshold (e.g., 90%).
      • Example: Using the top predictions ('S', 's', '5', 'B'), their combined confidence is: 0.40 + 0.30 + 0.15 + 0.05 = 0.90.
        • If the combined score is below the threshold (e.g., 85%), the system gives up on this character slot and outputs a placeholder like "".
        • If the score meets or exceeds the threshold, the decoder narrows down predictions further (e.g., keeping only 'S' and 's' if a stricter threshold like 0.50 is used).
  3. Word Assembly: Once character predictions pass the filters, they are sent to the next stage: assembling them into valid words. This limits the number of full-word combinations generated from the remaining character predictions after filtering.
    • Example: After filtering, 20 valid word combinations remain. If Max Word Combinations is set to 5, only the top 5 most confident word results are returned. The remaining 15 combinations are ignored, even if they are valid.

Recognition Parameters

This section provides the Recognition Parameters to help refine the recognition process.


decodingTopkIgnoreCutoff

    Integer TextOCR.Settings.decodingTopkIgnoreCutoff

Description: The maximum number of highest-confidence character predictions the "Total" decoder considers for each character position, impacting the accuracy and completeness of text recognition.. If additional characters are needed to meet the Total Prob Threshold, the model outputs a replacement character (e.g., "�"). This parameter is applicable for the following scenarios:

  • Complex Text Recognition - Increase this parameter for documents with complex or ambiguous text where capturing all character variations is crucial.
  • Improving Character Accuracy - Use this setting in scenarios where critical text components are consistently missing, ensuring thorough character analysis.
  • Adaptive Text Processing - Adjust dynamically based on the complexity and quality of input text to optimize OCR performance.

Tuning effect: Generally, keep this at the default value. If the expected character does not appear in the OCR output, increasing this value allows for more less confident decodes.

Default: 4

Valid range: [1, max(int)]


decodingTotalProbThreshold

    Float TextOCR.Settings.decodingTotalProbThreshold

Description: Sets the minimum cumulative confidence score that character decodes must achieve to be accepted. This setting is crucial in the total decoding strategy of the OCR recognition process, as it balances accuracy and coverage in text recognition. If the threshold is not reached, no high-confidence decode exists, resulting in a placeholder character (�) appearing in the output.

Relevant scenarios:

  • Improving Decode Coverage - Lower the threshold when critical text characters are missing, to capture a wider range of more potential decodes.
  • Analyzing Complex Documents - Apply this setting for documents with ambiguous or low-quality text to ensure more comprehensive character recognition.
  • Adaptive Recognition - Adjust dynamically based on the quality and complexity of input documents to optimize OCR performance for specific needs.

Tuning effect: If many characters are not decoded, evidenced by multiple � characters, decreasing this value may improve results. Increase this parameter for more flexible but potentially noisier results. Decrease this parameter for more trustworthy, reliable outputs.

Default: 0.9f

Valid range: [0.0f, 1.0f]


decodingMaxWordCombinations

    Integer TextOCR.Settings.decodingMaxWordCombinations

Description: Restricts the number of valid word outputs generated from possible character combinations for each detection. This helps avoid overwhelming results, particularly for ambiguous inputs, by limiting the model’s consideration of all potential character combinations across all positions in the word. It is applicable for the following scenarios:

  • Detailed Text Analysis - Increase this parameter for applications that require a thorough analysis of text.
  • Data Extraction - Adjust this parameter to optimize the extraction of comprehensive data from documents with complex or ambiguous text.

Tuning effect: Increasing this number returns more decodes, but potentially with lower confidence. Decreasing this parameter results to faster processing and fewer alternatives.

Default: 10

Valid range: [1, max(int)]


Recognition: Special Cases

These features are intended only for special scenarios and are usually not needed for most OCR tasks.


flip

    boolean TextOCR.Settings.flip

Description: Runs recognition in multiple orientations to boost accuracy on rotated or flipped text. If set to true, performs recognition twice - once in the regular orientation and once rotated by 180 degrees. Enable only if text orientation varies, as it increases processing time.

Tiling

Tiling helps OCR handle very long, thin lines of text (like serial numbers, document titles, or part numbers) by splitting them into smaller, manageable pieces ("tiles") for better recognition. This is useful when a word box exceeds the recognition limit (15 characters). Tiling adds processing time and should only be used as needed.


Sample Code

Sample code demonstrating use of recognition parameters:

  1. Initialize Settings: Configure the OCR settings, including additional parameters such as heatmapThreshold and tiling.

  2. Create TextOCR Instance: Use an executor to initialize the TextOCR instance asynchronously with the configured settings.

  3. Load Bitmap Image: Prepare the bitmap image that you want to analyze using OCR.

  4. Perform OCR: Invoke the detect method on the TextOCR instance to analyze the bitmap image, managing the asynchronous processing with the executor.

  5. Process OCR Results: Handle the results, which include complex bounding boxes and recognized text.

  6. Dispose Resources: After completing OCR operations, call the dispose method on the TextOCR instance to release resources and prevent memory leaks.

    import com.zebra.ai.vision.TextOCR; 
    import com.zebra.ai.vision.TextOCR.Settings; 
    import android.graphics.Bitmap; 
    
    // Initialize settings 
    String mavenModelName = "text-ocr-recognizer"; 
    TextOCR.Settings textOCRSettings = new TextOCR.Settings (mavenModelName); 
    textOCRSettings.heatmapThreshold = 0.5f; 
    textOCRSettings.decodingTotalProbThreshold = 0.9f; 
    textOCRSettings.tiling.enable = true; 
    
    // Optional : set runtime processing order, by default DSP will be used 
    Integer[] rpo = new Integer[]{InferencerOptions.DSP}; 
    textOCRSettings.detectionInferencerOptions.runtimeProcessorOrder = rpo; 
    textOCRSettings.recognitionInferencerOptions.runtimeProcessorOrder = rpo; 
    
    // Instantiate TextOCR with the configured settings 
    // settings = TextOCR.Settings object created above 
    // Executor = An executor thread for processing API calls and returning results 
    
    // Initialize executor 
    Executor executor = Executors.newFixedThreadPool(1); 
    
    CompletableFuture<TextOCR> futureObject = TextOCR.getTextOCR(textOCRSettings, executor); 
    
    // Use the futureObject to implement thenAccept() callback of CompletableFuture. 
    futureObject.thenAccept (OCRInstance -> { 
        // Use the textocr object returned here detecting barcodes, shelves, or products 
        textocr = OCRInstance; 
    }); 
    
    // Load your Bitmap image 
    Bitmap image = ...; 
    
    // Perform OCR 
    CompletableFuture<Result[]> futureResult = textocr.detect(bitmap,executor); 
    futureResult.thenAccept (Result -> { 
        // Process the returned output that contains complex bounding boxes and recognized text 
    }); 
    
    // Dispose resources 
    // Once done using the textOCR object, dispose it to release resources and memory used for detection. 
    textOCR.dispose();
    

Tiling

The OCR Recognition stage limits Word boxes to 15 characters. To achieve good results with “Words” containing more than 15 characters, such as ID numbers or VINs, enable Tiling. Tiling splits text boxes generated at the localization stage into overlapping crops, performs recognition on each, and uses a correlation-based merging algorithm to prepare a unified decode. Tiling increases processing time, so use it only when needed. Not all “Words” will be tiled; only those meeting threshold criteria specified by the developer will be tiled.


Tiling Settings

The TilerSettings class is a configuration component within the TextRecognizer.Settings framework of the Zebra AI Data Capture SDK. It provides parameters to fine-tune the behavior of the tiling feature, which is used during the text detection and recognition process. These settings primarily control how boxes are merged and processed based on their aspect ratios and correlation thresholds.

Caution: Tiling adds processing time and should only be used as needed.

Configure TilerSettings in the following scenarios:

  • Large Document Processing: Enable tiling to process large documents efficiently, especially when sections require individual handling due to size limitations.
  • Complex Layout Handling: Adjust tiling settings for documents with complex layouts to improve the accuracy of text recognition.
  • Performance Tuning: Fine-tune parameters to achieve an optimal balance between processing speed and accuracy, based on specific application requirements.

enable

    Boolean TextOCR.Settings.TilerSettings.enable

Description: Enables or disables the tiling feature. When true, TextOCR performs tiling operations on detected text regions, splitting boxes that meet aspect ratio criteria into multiple tiles, recognizing text, and merging results using a correlation method.

Default: false


aspectRatioLowerThr

    Float TextOCR.Settings.TilerSettings.aspectRatioLowerThr

Description: A float attribute that raises the threshold for tiling boxes with smaller aspect ratios. Defines the lower limit - only boxes wider than this value get tiled (controls which boxes are considered "elongated"), since they likely contain long text strings. Tune this parameter together with aspectRatioUpperThr.

Tuning effect: Decreasing this threshold results in more rectangular-shaped (low-aspect ratio) boxes being tiled. If the desired text box is not tiled, decreasing this parameter may help. Increase this threshold to tile only very long boxes.

Default: 10.0f

Valid range: [1.0f, inf]

Return Value: Float value representing the lower threshold for aspect ratios.


aspectRatioUpperThr

    float TextOCR.Settings.TilerSettings.aspectRatioUpperThr

Description: Defines the upper limit – only boxes up to this width/height ratio get tiled (prevents superlong, odd boxes being tiled). Filters boxes with very high aspect ratios, may rarely occur and be false positives from the text detector model. Tune this parameter together with aspectRatioLowerThr. A similar effect can be achieved with the minBoxSize parameter.

Tuning effect: Increasing this parameter allows tiling of more long and narrow boxes. Decreasing this parameter avoids tiling extremely stretched or odd-shaped boxes.

Default: 40.0f

Valid range: [1.0f, inf]

Return Value: Float value representing the upper threshold for aspect ratios.


topkMergedPredictions

Integer TextOCR.Settings.TilerSettings.topkMergedPredictions

Description: Limits the number of decodes returned based on confidence scores. This affects how many merged combinations are returned during the tiling stage.

Tuning effect: Increasing this parameter increases the number of possible results to review. Decreasing this parameter results to fewer, faster results.

Default: 5

Valid range: [1, max(int)]

Return Value: Integer representing the top merged predictions to return.


Advanced Tiling Parameters

These advanced tiling parameters are only intended to be utilized in edge cases that are difficult to solve. Zebra recommends not to change these parameters unless necessary.


topCorrelationThr
    Float TextOCR.Settings.TilerSettings.topCorrelationThr

Description: Sets the threshold for correlation to consider merging boxes. Increasing this value decreases the number of merge points considered.

Tuning effect: Increasing this value restricts the internal merging mechanism to use only points with a correlation score higher than this value. Setting it to 0 removes the limit. If incorrect tiling occurs, increasing this parameter may help.

Default: 0.0f

Valid range: [0.0f, 1.0f]

Return Value: Float value representing the correlation threshold value.


mergePointsCutoff
    Integer TextOCR.Settings.TilerSettings.mergePointsCutoff

Description: Determines the cutoff for the number of merge points. If the number exceeds this value, merging is not performed. This internal parameter limits the number of possible combinations used for tile merging.

Tuning effect: Increasing this value results in more combinations being used, increasing processing time but potentially generating more accurate results.

Default: 5

Valid range: [1, max(int)]

Return Value: Integer representing the maximum number of merge points allowed.


splitMarginFactor
    Float TextOCR.Settings.TilerSettings.splitMarginFactor

Description: Reduces the probability of characters appearing at the end due to splitting.

Default: 0.1f

Valid range: [0.0f, 1.0f]

Return Value: Float value representing the factor applied to margin splitting.


Sample Code

The TilerSettings object is part of the TextOCR.Settings configuration. Access and modify TilerSettings through the TextOCR.Settings object.

This sample code demonstrates how to configure TilerSettings and process the image for text detection and recognition:

  1. Initialize Settings: Begin by creating a TextOCR.Settings instance.

  2. Configure TilerSettings: Access the TilerSettings within the TextOCR.Settings instance and set custom values for tiling parameters to control how the image is divided and processed.

  3. Instantiate TextOCR: Use the configured settings to create a TextOCR instance. This object will handle the text detection and recognition processes.

  4. Load Bitmap Image: Prepare the image for OCR by converting it to a Bitmap object.

  5. Perform Detection: Use the detect method to analyze the image and retrieve an array of OCRResult objects containing the detected text.

  6. Print Results: Iterate over the OCRResult array to output the recognized text to the console.

  7. Dispose Resources: Free up system resources by calling the dispose method on the TextOCR object after usage.

    import com.zebra.aivision.TextOCR;
    import com.zebra.aivision.TextOCR.Settings;
    import com.zebra.aivision.TextOCR.Settings.TilerSettings;
    import android.graphics.Bitmap;
    
    // Initialize settings with custom tiling options
    TextOCR.Settings.TilerSettings tilerSettings = new TextOCR.Settings.TilerSettings();
    tilerSettings.tiling.enable = true;
    tilerSettings.tiling.aspectRatioLowerThr = 8.0f;
    tilerSettings.tiling.aspectRatioUpperThr = 35.0f;
    tilerSettings.tiling.mergePointsCutoff = 10;
    
    // Initialize executor
    Executor executor = Executors.newFixedThreadPool(1);
    
    // Input params: bitmap image (to perform detection) and an executor thread object (in which the detection happens and the results are returned)
    CompletableFuture<OCRResult[]> futureResult = textocr.detect(bitmap,executor);
    
    futureResult.thenAccept (ocrResults -> {
        //Process the returned output that contains complex bounding boxes and text in it.
    });
    
    // Dispose resources
    // Once done using the textOCR object, dispose it to release resources and memory used for detection.
    textOCR.dispose()
    

Text Grouping

After Words are identified and decoded from the Text Recognition stage, the Text Grouping stage organizes them into lines or paragraphs. This process is carried out in 2 steps:

  1. Words detected by OCR are grouped into Lines.
  2. The Lines are further grouped into Paragraphs.

In the graphic representation below, Words, Lines and Paragraphs are represented by blue, green and fuchsia borders, respectively.

image


Grouper Settings

The GrouperSettings class provides parameters for customizing the behavior of the OCR text grouping algorithm. It offers control over how text elements are spatially organized based on their geometric properties. By adjusting these settings, developers can fine-tune how text boxes are grouped into lines, paragraphs, or other structures based on their spatial relationships.


widthDistanceRatio

    Float TextOCR.Settings.GrouperSettings.widthDistanceRatio

Description: Determines the threshold for joining Words into Lines. Adjusting this parameter allows control over acceptable spacing between Words in a Line. Words spaced beyond this threshold are treated as separate Lines. The default value of 1.5f indicates that the acceptable space between Words should not exceed 50% of their average width. Increasing this value to 2.0f allows for a maximum acceptable space of 100% of the average Word width.

For example, if the average Word width is 90 pixels, widthDistanceRatio of 2.0 allows words with centers up to 180 pixels apart to be grouped into the same Line.

Tuning effect: Increasing this parameter causes horizontally spaced Words to join into a Line. Set this value higher if Words are spaced further apart and should be joined into a Line, such as in artistic layouts.

Default: 1.5f

Valid range: [0.0f, inf]

image


heightDistanceRatio

    Float TextOCR.Settings.GrouperSettings.heightDistanceRatio

Description: Affects the grouping of Words into Lines, particularly in scenarios where text undergoes a sudden change in font size but should still be grouped together. Although the algorithm has no knowledge of the actual font size, it uses the height of the complex bounding box to approximate it. The default value of 2.0f indicates that Words will be grouped together even if their font size differs by up to twice the height.

For example, setting this parameter to 4.0 allows words with height differences up to 4 times to be grouped into the same Line.

Tuning effect: Increasing this parameter allows words of varying heights to join into a single Line. Raise this value higher when there is significant variation in text sizes within the same line, such as in documents with mixed fonts. Decrease this parameter if strange font-size jumps are creating messy lines.

Default: 2.0f

Valid range: [1.0f, inf]

image


centerDistanceRatio

Float TextOCR.Settings.GrouperSettings.centerDistanceRatio

Description: Affects the joining of Words into Lines, particularly in scenarios where lines of text are not perfectly straight, such as in curved lines of text. The threshold value should be adjusted empirically, as it mathematically represents the relationship between the positions of two consecutive Words.

For example, if the average Word height is 20 pixels, setting centerDistanceRatio to 1.0 allows Words with centers up to 20 pixels apart vertically to be grouped into the same Line.

Tuning effect: Increasing this parameter allows Words that are not vertically aligned to be joined into the same Line. Decrease this value if only straight lines should be grouped.

Default: 0.6f

image


paragraphHeightDistance

    Float TextOCR.Settings.GrouperSettings.paragraphHeightDistance

Description: Determines the difference in vertical spacing between the center of two Lines to determine if they should be grouped into a Paragraph. It is particularly useful when the Lines of text have unusually large "leading", which refers to the distance between consecutive Lines in a Paragraph. The default value of 1.0f indicates that the Lines can be grouped into a paragraph if their centers are spaced apart by 100% of their average height.

For example, if the average Line height is 30 pixels, setting this parameter to 2 allows Lines with centers up to 60 pixels apart to be grouped into a Paragraph.

Tuning effect: Increasing this parameter allows Lines that are spaced farther apart vertically to be joined into a Paragraph. Consider raising this value higher for documents with widely spaced Lines. Decrease this value if too many lines are getting grouped.

Default: 1.0f

Valid range: [0.0f, inf]

image


paragraphHeightRatioThreshold

    Float TextOCR.Settings.GrouperSettings.paragraphHeightRatioThreshold

Description: Determines if there is a significant height difference between two rows, expressed as a ratio of the heights of two adjacent Lines, to decide whether they should be joined into a Paragraph. This can be useful in scenarios when Lines of varying font sizes should be joined into a single Paragraph. Although the algorithm has no knowledge of actual font sizes, it uses the height of the complex bounding box as an approximation. The default value of 1.0/3.0f (approximately 0.33) indicates that if consecutive Lines differ in height by a facotr of up to 3, they will still be grouped together into a single Paragraph.

For example, if the average Line height is 50 pixels, setting this parameter to 0.2 allows Lines with heights ranging from approximately 10 pixels to 250 pixels to be grouped into the same Paragraph.

Tuning effect: Decreasing this parameter allows Lines with larger height differences to be joined into a Paragraph, which can be useful for documents with diverse fonts. Increase this parameter to only group similar-sized lines.

Default: 0.33f

Valid range: [0.0f, 1.0f]

image


Sample Code

To utilize the OCR capabilities of the TextOCR library, follow these steps to configure settings, prepare your image, and perform text detection:

  1. Configure Settings: Initialize a TextOCR.Settings object and customize the GrouperSettings parameters for text grouping.

  2. Asynchronous Initialization: Use an Executor to initialize the TextOCR instance asynchronously, allowing for efficient resource management and responsiveness.

  3. Load Bitmap Image: Prepare the image for OCR by converting it to a Bitmap object.

  4. Perform OCR: Use the detect method to analyze the image, retrieving an array of OCRResult objects with complex bounding boxes and recognized text.

  5. Process OCR Results: Handle the results by iterating over the OCRResult array, outputting the recognized text or using it for further processing.

  6. Dispose Resources: After completing OCR operations, call dispose() to release resources and prevent memory leaks.

    import com.zebra.ai.vision.TextOCR; 
    
    // Initialize TextOCR settings 
    String mavenModelName = "text-ocr-recognizer"; 
    textOCRSettings = TextOCR.Settings(mavenModelName) 
    
    // Access the GrouperSettings and set custom values for grouping parameters 
    textOCRSettings.grouping.widthDistanceRatio = 1.5f; 
    textOCRSettings.grouping.heightDistanceRatio = 2.0f; 
    textOCRSettings.grouping.centerDistanceRatio = 0.6f; 
    textOCRSettings.grouping.paragraphHeightDistance = 1.0f; 
    textOCRSettings.grouping.paragraphHeightRatioThreshold = 0.33f; 
    
    // Initialize executor 
    Executor executor = Executors.newFixedThreadPool(1); 
    
    // Input params include the bitmap image (to perform detection on) and an executor thread object (in which the detection happens) 
    CompletableFuture<OCRResult[]> futureResult = textocr.detect(bitmap,executor); 
    
    futureResult.thenAccept (ocrResults -> { 
        // Process the returned output that contains complex bounding boxes and text within it 
    }); 
    
    // Dispose resources 
    // Once done using the textOCR object, dispose it to release resources and memory used for detection. 
    textOCR.dispose();
    

Troubleshooting Guide

Note:

  • ↑ indicates an increase in the value or parameter
  • ↓ indicates a decrease in the value or parameter

Quick Tips for Detection

If the following issues are encountered, try these adjustments:

Issue Suggested Adjustment
Missing faint/small text ↓ Heatmap and Box Threshold or ↓ Min Box Size/Area
Too much junk/noise ↑ Heatmap and Box Threshold or ↑ Min Box Size/Area
Boxes are too tight, cutting off letters ↑ Unclip Ratio
Boxes overlap too much ↓ Unclip Ratio
Weird rotations on lines or rules ↑ Min Ratio for Rotation
Elongated or tall font styles ↑ Min Ratio for Rotation
Need to detect tilted/angled/curved text ↓ Min Ratio for Rotation and ↑ Unclip Ratio

Quick Tips for Recognition

If the following issues are encountered, try these adjustments:

Issue Suggested Adjustment
Too many mistakes or incorrect guesses ↑ Total Probability Threshold
Missing letters or "�" characters in output ↓ Total Probability Threshold and ↑ TopK Ignore Cutoff
Unclear or handwritten text ↑ TopK Ignore Cutoff and ↓ Total Probability Threshold
Too many uncertain or incorrect decodes ↑ Total Probability Threshold
Missing faint or ambiguous characters ↓ Total Probability Threshold and ↑ Max Word Combinations
Need alternatives for post-processing ↑ Max Word Combinations

Quick Tips for Grouping

If the following issues are encountered, try these adjustments:

Issue Suggested Adjustment
Words that should be on one line are not ↑ Width Distance Ratio
Lines with different font sizes are not grouping ↑ Height Distance Ratio or ↓ Paragraph Height Ratio Threshold
Curved or wavy text splits into separate groups ↑ Center Distance Ratio
Lines in a paragraph are not grouping ↑ Paragraph Height Distance
Lines of different heights are not grouping ↓ Paragraph Height Ratio Threshold

Quick Tips for Tiling (Special Cases)

If the following issues are encountered, try these adjustments:

Issue Suggested Adjustment
Long, narrow text is not read correctly Enable tiling
Errors appear at the edges of tiles Adjust Split Margin Factor (usually leave at default)
Tiling merges boxes that should not be merged ↑ Top Correlation Threshold
Results are slow and perfect accuracy is not needed ↓ Merge Points Cutoff and ↓ TopK Merged Predictions
Results on very long text lines are inaccurate ↑ Merge Points Cutoff and ↑ Aspect Ratio Upper Threshold

Sample Apps

Refer to the following resources:

  • Start building your first product and shelf recognizer application with the QuickStart Sample application source.
  • Consult the Java/Kotlin snippets, which demonstrate the SDK's capabilities and can be easily integrated into your applications.
  • Access advanced use case and technology-based demos through the Showcase Application, including he AI DataCapture demo, which outlines how users can enroll and recognize products in real-time.