Streams API Basics for Data Processing

9 min read

The Streams API turns "loop, filter, transform, collect" into a fluent pipeline. If you've used JavaScript's array.filter(...).map(...).reduce(...) or Python's list comprehensions, the shape will feel familiar. A Java stream takes a collection, threads each element through a chain of operations, and lands in a result — a list, a count, a sum, a Boolean, an Optional. Streams are how modern Java codebases process test data: counting failures, grouping by priority, calculating pass rates, finding the slowest test. This lesson covers the operations you'll use most and the mental model that makes them easy to reason about.

Creating a stream

Most often you stream a collection:

List<String> tests = List.of("LoginTest", "Search", "CheckoutTest");
tests.stream()                      // Stream<String>

Or an array:

String[] arr = {"a", "b", "c"};
Arrays.stream(arr);                 // Stream<String>

Or a literal sequence:

Stream.of("a", "b", "c");
IntStream.range(0, 10);             // 0, 1, ..., 9

A Stream<T> is a lazy view over a sequence of T. It doesn't store data; it pulls from the source as you ask for results.

The everyday operations

Three families:

Intermediate (return a new stream, lazy):

  • filter(Predicate) — keep matching elements
  • map(Function) — transform each element
  • sorted() / sorted(Comparator) — sort
  • distinct() — drop duplicates
  • limit(n) / skip(n) — pagination
  • peek(Consumer) — debug-style "look at each element"

Terminal (run the pipeline, produce a result):

  • collect(Collectors.toList()) or .toList() — into a List
  • forEach(Consumer) — side effect on each
  • count() — long
  • findFirst() / findAny()Optional<T>
  • anyMatch(Predicate) / allMatch(...) / noneMatch(...) — boolean
  • min(Comparator) / max(...) / reduce(...)Optional<T>

Numeric specialised (when working with primitives):

  • mapToInt(...), mapToLong(...), mapToDouble(...) — switch to IntStream etc.
  • sum(), average(), min(), max() — only on numeric streams

A pipeline always has zero or more intermediate operations followed by exactly one terminal operation. No terminal call = nothing runs; the stream is lazy and only does work when you ask for results.

A real QA pipeline

import java.util.*;
import java.util.stream.*;
 
public class TestStreams {
 
    record TestResult(String name, String status, String priority, long durationMs) {}
 
    public static void main(String[] args) {
        List<TestResult> results = List.of(
            new TestResult("Login",         "PASSED", "P0", 1450),
            new TestResult("Search",        "PASSED", "P2",  820),
            new TestResult("Checkout",      "FAILED", "P0", 3120),
            new TestResult("Logout",        "PASSED", "P2",  990),
            new TestResult("Export",        "FAILED", "P1", 4800),
            new TestResult("BillingReport", "PASSED", "P1", 2100)
        );
 
        // 1) Count failures
        long failCount = results.stream()
            .filter(r -> r.status().equals("FAILED"))
            .count();
        System.out.println("Failures: " + failCount);
 
        // 2) Failure names, sorted alphabetically
        List<String> failureNames = results.stream()
            .filter(r -> r.status().equals("FAILED"))
            .map(TestResult::name)
            .sorted()
            .toList();
        System.out.println("Failure names: " + failureNames);
 
        // 3) Average duration across all tests
        double avg = results.stream()
            .mapToLong(TestResult::durationMs)
            .average()
            .orElse(0);
        System.out.printf("Average: %.0fms%n", avg);
 
        // 4) Slowest test
        TestResult slowest = results.stream()
            .max(Comparator.comparingLong(TestResult::durationMs))
            .orElseThrow();
        System.out.println("Slowest: " + slowest.name() + " (" + slowest.durationMs() + "ms)");
 
        // 5) Did every P0 test pass?
        boolean allP0Passed = results.stream()
            .filter(r -> r.priority().equals("P0"))
            .allMatch(r -> r.status().equals("PASSED"));
        System.out.println("All P0 passed? " + allP0Passed);
    }
}

Output:

Failures: 2
Failure names: [Checkout, Export]
Average: 2213ms
Slowest: Export (4800ms)
All P0 passed? false

Read each pipeline as a sentence: "of the results, keep failures, project to names, sort, collect into a list." That readability is half the point — the imperative loop equivalent for each pipeline is 6–10 lines and harder to skim.

Method references — the User::name shorthand

map(r -> r.name()) is fine; map(TestResult::name) is shorter and reads more naturally. Method references (lesson 3) shine in stream pipelines because most maps are exactly "call one method on each element." When the lambda body is literally one method call, prefer the reference form.

Optional — find without null

findFirst(), min(), max(), reduce() all return Optional<T> — a typed wrapper that's either present or empty. The point is to force you to handle the empty case rather than silently returning null:

Optional<TestResult> firstFail = results.stream()
    .filter(r -> r.status().equals("FAILED"))
    .findFirst();
 
if (firstFail.isPresent()) {
    System.out.println("first failure: " + firstFail.get().name());
}
 
// or — handle empty inline
String name = firstFail.map(TestResult::name).orElse("(no failures)");
System.out.println(name);

The most common operations on Optional:

  • .isPresent() — boolean check
  • .orElse(default) — value or default
  • .orElseThrow() — value or throw NoSuchElementException
  • .map(...) / .filter(...) — chain operations on the value if present
  • .ifPresent(consumer) — run code only if present

Optional is one of those features that feels weird until it's saved you from a NullPointerException. Then it stays saved.

Grouping with Collectors.groupingBy

For grouping data by a key, Collectors.groupingBy is the heavyweight tool:

import java.util.*;
import java.util.stream.*;
 
Map<String, List<TestResult>> byPriority = results.stream()
    .collect(Collectors.groupingBy(TestResult::priority));
 
byPriority.forEach((p, list) -> System.out.println(p + " -> " + list.size() + " tests"));

Output:

P0 -> 2 tests
P1 -> 2 tests
P2 -> 2 tests

Or count failures by priority in one pass:

Map<String, Long> failuresByPriority = results.stream()
    .filter(r -> r.status().equals("FAILED"))
    .collect(Collectors.groupingBy(TestResult::priority, Collectors.counting()));
 
System.out.println(failuresByPriority);     // {P0=1, P1=1}

The two-argument groupingBy(keyFn, downstream) lets you group and aggregate in one call. Collectors.counting(), Collectors.summingLong(...), Collectors.mapping(...) are the building blocks. They feel verbose at first; once you've replaced your fifth nested loop with one of them, you'll see why they exist.

Streams are lazy

Intermediate operations don't do work until a terminal operation kicks the pipeline. Notice this peek only fires for elements the terminal operation actually consumes:

long count = Stream.of("a", "bb", "ccc", "dddd")
    .peek(s -> System.out.println("seen: " + s))
    .filter(s -> s.length() >= 2)
    .limit(2)
    .count();
 
System.out.println("count: " + count);

Output:

seen: a
seen: bb
seen: ccc
count: 2

Only three elements are walked — limit(2) short-circuits the pipeline once it has two matches. Imperative code can't get that for free; you'd have to add a counter and a break. Streams handle it because each operation is asked one element at a time, on demand.

Streams vs loops — when to pick which

Streams shine when you're transforming data — filter, map, group, collect. Plain loops are still the better choice when:

  • The body is genuinely complex (multiple conditional branches, mutating multiple counters).
  • You need explicit early termination with a return from the enclosing method.
  • The collection is small enough that readability wins over expressiveness.

A single-pass stream of 6 results is not faster than a loop. It's not noticeably slower either. The choice is about clarity. If a pipeline reads like a sentence, use it; if it reads like a puzzle, use a loop.

A pipeline, step by step

Step 1 of 6

Source: List<TestResult>

Six TestResult objects in memory. Calling .stream() returns a Stream<TestResult> view — no data is copied.

Five steps, one fluent expression: results.stream().filter(...).map(...).sorted().toList(). The diagram is also the shape of every stream pipeline you'll write — the verbs change, but the rhythm of source → intermediate → intermediate → terminal stays the same.

⚠️ Common mistakes

  • .toList() returns an unmodifiable list. Calling .add(...) on the result throws UnsupportedOperationException. If you need a mutable result, either collect(Collectors.toList()) (mutable in current implementations, though documented as "no guarantee") or wrap with new ArrayList<>(stream.toList()).
  • Iterating a stream twice. Streams can be consumed once. After a terminal operation, the same stream is closed. If you need two passes, build two streams from the source: list.stream().count() and list.stream().anyMatch(...).
  • Chaining forEach and expecting a result. list.stream().filter(...).forEach(...) runs the side effect; it doesn't return a list. To collect and print, split into two stages: var kept = list.stream().filter(...).toList(); kept.forEach(System.out::println);.

🎯 Practice task

Build a real test report with streams. 30 minutes.

  1. Create StreamReport.java. Define record TestResult(String name, String status, String priority, long durationMs) {}.
  2. Build a List<TestResult> with at least 8 entries — mix priorities (P0, P1, P2), statuses (PASSED, FAILED, SKIPPED), and durations.
  3. Compute and print:
    • Failure count with .stream().filter(...).count().
    • Pass rate with (double) passed / total * 100. Use .filter(...).count() twice (or once + arithmetic).
    • Failure names sorted alphabetically with .filter(...).map(TestResult::name).sorted().toList().
    • Slowest test with .max(Comparator.comparingLong(TestResult::durationMs)).orElseThrow().
    • Average duration of P0 tests with .filter(r -> r.priority().equals("P0")).mapToLong(TestResult::durationMs).average().orElse(0).
  4. Use Collectors.groupingBy(TestResult::priority, Collectors.counting()) to print a count of tests per priority. Verify the totals add up to your list size.
  5. Use Collectors.groupingBy(TestResult::priority, Collectors.summingLong(TestResult::durationMs)) for a "total duration per priority" view.
  6. Use .allMatch(r -> r.priority().equals("P0") ? r.status().equals("PASSED") : true) (or a cleaner equivalent) to confirm "all P0 tests passed." Try changing one P0 to FAILED and re-running.
  7. Stretch: rewrite the slowest-test query with .sorted(Comparator.comparingLong(TestResult::durationMs).reversed()).findFirst(). Confirm both forms produce the same result. The .max(comparator) form is shorter; the sorted-then-findFirst form generalises to "top N" with a .limit(n).

That closes Chapter 8 — and the data-handling foundation of the course. Chapter 9 is the capstone: putting strings, regex, exceptions, file I/O, OOP, and streams together to build a real test data management utility.

// tip to track lessons you complete and pick up where you left off across devices.