Apache Kafka – general/notes

This is not a tutorial about Apache Kafka – just notes about setting up few brokers, producing and consuming from a topic.

Achieving high-troughput is done by horizontally scale (increasing the number of brokers). Linkedin 14000 brokers -> 2 petabytes per week.

Cluster – a Kafka cluster is a grouping multiple brokers (it doesn’t matter if they are on a single machine on multiple machines)

What matters is how independent brokers are grouped in order to form
a cluster. The grouping mechanism is an important part of Kafka architecture and allows it to scale -> here apache zookeper comes in.

The system of nodes require coordination to ensure consistency and
progress towards a common goal. Proper coordination can avoid duplication work. Each nodes communications with each other through messages.

In Kafka there is a Controller that is ellected (there are different techniques)
his responsabilities are:
— to mentain a list with workers
— to mentain a list with Work Items
— to maintain status on active tasks and staff

In Kafka a team is The Cluster and members are it’s brokers and
they decide who is the Controller in cluster.


The controller should now who is available and healthy in order to take work, and also it should know what are the risk policies.
Task redundancy – it has to ensure that in case a work assigned to a worker in case worker dies is done and is not lost. This means that a task given to a worker should give to another workers in case of an eventual catastrophe.

After determining redundancy, it will assign to a leader and the leader will take care to assign task to workers.

After task is assigned to leader, a quorum is formed and workers are voting (they are becoming followers), controller will assign task to Leader that was elected.

In Apache Kafka Work that is performed is receiving messages, categorizing them into topics and reliable persists them.

The effort required to receive messages from producers is substantially less that the effort required to serve that message to consumer.

Replicas – number of replicas for an message published to a topic

Partition – number of partition

Leader – leader for that partition

Isr – in sync replicas, in case replicas != isr – partition is not in a healthy state

Distributed Systems: Communication and Consensus (with Apache Zookeper). Gossip protocol.

Events related to brokers becoming available. Retrieving configuration management and being notified about new config.
– Leader election
– Health status

Apache Zookeper – Centralized service for maintaining metadata about a cluster of distributed nodes.

  • Configuration information
  • Health status
  • Group membership

Used by: Hadoop, HBase, Apache Mesos, Slrs, Redis and Neo4j

Zooker is a distributed system consisting of multiple nodes in an “ensemble”.

At the heart of apache kafka you have a Cluster, Zookeper providers metadata required by Cluster brokers in order to scale and work reliable.

Understanding Topics, Partitions and Brokers

Named feed or category of messages

  • Producers produce to a topic
  • COnsumers consume from a topic

Logical entity – Physically represented as a log

Logical Topic:

  • A topic can spin to all brokers (or a few of tem) – it’s not our concern
  • Producer -> sends to topic -> events are immutable ->
    -> if message is incorrect – consumer concern to deal with this
  • An architectural style or approach to maintaining an application’s stable by capturing all changes as a sequence of time-ordered, immutable events.
  • Message Content. Each messages has:
    — Timestamp
    — Referenceable identifier
    — Payload (binary) Each consumer should maintain its exclusive operation boundaries, in case a consumer has a bug and fails to consumer a message – this should not affect other consumers. Consumer maintain their autonomy with “Offset”, It’s a placeholder, like a bookmark.
  • Last read message position – fully maintained by the Kafka consumer
    it tracks what it has read and what it has not read
  • Corresponds to the message identifier. When a consumer wants to read from a topic it should establish a connection and should say from where it wants to read messages (ex: “from beginning). Behind the scenes {offset: 0}.
  • Another consumer can be at another message in topic. It’s possible to return to a previous event. Can read from the last message “from last offset” When another messages arrives, the connection Consumers will receive an event that says that a new message arrived and they can increase advance their position. Message Retention Policy – Slow consumers Apache Kafka retains all published messages regardless of consumption. Retention period is configurable in hours.
  • Default is 168 hours or seven days – expiring from the oldest messages
  • Retention period is defined on a per-topic basis
  • Physical storage is a constraint for retention policy

Demo: Starting apache Kafka and Producing and Consuming Messages

//start Kafka:
bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh –create –topic my_topic –zookeeper localhost:2181 –replication-factor 1 –partitions 1
ll /tmp/kafka-logs
bin/kafka-topics.sh –list –zookeeper localhost:2181

Produce/Consumer some basic messages:

Transaction or Commit Logs

Transaction – Source of truth
Physically stored and maintained
Higher-order data structures derive from log

Point of recovery – is log file, you replay from log. Basis for replication and distribution – redundancy, fault tolerant and distributed

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Kafka partitions
How a logic concept is implemented in a pshysical world.
Each topic has one or more partitions, the number of partitions
dependens on usage circumstances.
A partition is a basis for Which Kafka can:

  • Scale
  • Becomve fault-tolerant
  • Achieve higher levels of throughput

Each partition is maintained on at least one or more Brokers.

Using a partition is limited by CPU, Memory and Storage. For high-use scale out, consider each partition fit on one machine. If you have only one partition only.

In general, the scalability of Apache Kafka is determined by the number of partitions being managed by multiple broker nodes.

Suppose with 3 partitions – single split in 3 log file, ideally managed by 3 brokers, partitions are mutually exclusive. This allow to increase parallelism.

How Kafka splits the message across different partitions – is based on a partition scheme that can be set by producer – default round-robin.

Lasă un răspuns

Adresa ta de email nu va fi publicată. Câmpurile obligatorii sunt marcate cu *

Acest sit folosește Akismet pentru a reduce spamul. Află cum sunt procesate datele comentariilor tale.