December 15, 2009

Ad hoc cell splitting re-post (original website down)

Information about cell-id splitting, stuck beacons, and failed IBSS merges!

From VillageTelco

Jump to: navigation, search

Contents

The trouble with Madwifi and IBSS (ad-hoc) mode

While testing the Atheros Madwifi driver on the Ubiquity Nanostation2 in the Freifunk mesh network I ran into the problem that from time to time the card stopped working for approximately 30 seconds. I performed VOIP-calls with David Rowe using the Nanostation via the Freifunk network, but those irregular intermissions are annoying and would be a real show-stopper for the Villagetelco project. (If you depend on a chain of forwarding nodes to your gateway and they all stop working randomly for one minute within 15 minutes, there is always one that interrupts your call!)

It is the notorious Madwifi "stuck beacon" problem that causes intermittent operation in IBSS mode, caused by MAC timer skews when cards try to perform an "IBSS merge". If a MAC timer shift occurs because of an attempted IBSS merge it could happen that a beacon in the transmit queue gets stuck - the transmission gets never triggered. Until the transmit queue gets purged the card ceases to transmit forever. One way to purge the transmit buffer is to perform manually a
 iwlist ath0 scan 

Note that this helps only for a while until the next beacon gets stuck in the transmit buffer!

The emission of beacons in IBSS mode

Every ~100 ms (default settings) someone in the ad-hoc cell has to send a beacon to reveal the presence of the ad-hoc cell to other WiFi devices. There is a mechanism to avoid that multiple beacons get send redundantly by multiple nodes which are part of the same ad-hoc cell. After the last beacon has been send WiFI cards calculate a time that is 100ms ± random jitter in the future and wait whether another ad-hoc node sends a beacon before their own timer reaches this value. If another node sends the beacon before their own emission is triggered they discard the beacon. Otherwise they send the beacon.

The "stuck beacon" occurs when a shift of the MAC clock in the card occurs due to the process of an attempted "IBSS merge". In order to separate different wireless cells on MAC level they identify themselves by transmitting a IBSS-ID (I simply call them Cell-ID). In accesspoint mode the Cell-ID is identical with the MAC address of the accesspoints WiFI card. In ad-hoc mode there is no master node. Hence the developers of 802.11 thought there must be a process to negotiate and merge the IBSS-ID between nodes.


The process of an "IBSS merge"

If you configure your WiFi card to operate in ad-hoc mode on channel 1 with the ESSID "village-telco-adhoc-mesh" and activate the interface, your card will listen on the channel to see if beacons are transmitted for that ESSID. In case there are no beacons your card generates a random Cell-ID and starts to send beacons containing timestamps according to its own MAC timer and the Cell-ID. Now someone else does exactly the same with another WiFi card, while being not in range of the transmissions. The card is configured to use the same ESSID and channel in ad-hoc mode. Because the card doesn't receive any beacons for the ESSID after start up it also generates a new random Cell-ID and starts to send beacons.

Imagine that both cards are used in mobile devices that move around. At one point both cards receive each others beacons. Since both cards use the same ESSID they should be in the same wireless cell. However their Cell-IDs are different so they belong to different wireless networks until they agree to merge to one Cell-ID and drop the other. In order to decide to which Cell-ID they have to merge, the cards compare their time stamps. The older time stamp wins. So the card with the younger time stamp switches to the other Cell-ID and adjusts its MAC timer to the time stamp which it has received in the beacon. Beacons issued by this card from now on will contain the older time stamp.

This way multiple wireless cells consisting of multiple ad-hoc nodes configured with the same channel and ESSID can merge to one big wireless cell. In Berlin the Freifunk community mesh network is a single wireless ad-hoc cell operating on channel 10 with the ESSID "olsr.freifunk.net". This single ad-hoc cell is nearly as big on roof level as the city of Berlin and it consists of ~550 wireless interfaces on average - mostly routers, but also laptops and PCs.


The phenomenon of IBSS-ID cell splits

The ad-hoc mode has been widely neglected by chipset manufacturers and driver developers. Implementing ad-hoc mode is much more complicated than implementing station mode (accesspoint client). As a matter of fact if WiFi would have started only with ad-hoc mode as basic mode of operation we wouldn't miss anything. Ad-hoc mode means the capability that everyone in range can talk to everyone (multipoint to multipoint) while access point mode means everyone can only talk to the access point - only the accesspoint can talk to everone (point to multipoint). Who needs a communication mode where all can only talk to one and via one, while there is a mode available where everyone can talk to everyone? The simple functionality of an Accesspoint (DHCP, DNS relay) can also be performed by an ad-hoc station. There is only one reason why you shouldn't run an accesspoint in ad-hoc mode: Most devices that you want to connect to it have a buggy implementation of the ad-hoc mode, they simply don't work at all or are unreliable. What is worse: Even if most nodes in your network are operating properly according to the specs of 802.11, it takes a single buggy device to mess up your network!

A buggy device can do all kind of nasty things like:

  • Send wrong time stamps after merging to the Cell-ID..
  • Not merging at all.
  • Send the right time stamp but with the wrong Cell-ID.

The result of false timestamps are IBSS-ID cell-splits, WiFi card lock-ups, intermittent operation. That is the situation when beacons with different timestamps and the same Cell-ID are in the air, which results in timestamps that are jumping forth and back. Or you have different Cell-IDs that carry the same time stamp. Or the time stamp tells your card that the Cell is now running since 500000 years - so there is an overflow in the counter and the MAC timer starts from zero.

At Freifunk we have learned that in such a environment you have to use some tricks to get around - otherwise your wireless card (or the whole operating system) will crash, you have multiple little ad-hoc cells that don't talk to each other instead of a single one.

I have compiled a recent version of the iwl3945 driver for a Intel 802.11abg with verbose debugging options enabled. This is what dmesg tells me when I try to connect to the Freifunk mesh:

[ 1655.102233] RX beacon SA=00:0b:6b:20:22:fe BSSID=02:ca:ff:ee:ba:be TSF=0x0 BCN=0xffff95d81ca98181 diff=116719550365311 @338775
[ 1655.102241] eth1: beacon TSF higher than local TSF - IBSS merge with BSSID 02:ca:ff:ee:ba:be
[ 1655.102249] phy1: Removed STA 00:0b:6b:20:22:fe
[ 1655.102254] phy1: Removed STA 00:80:48:52:ff:9e
[ 1655.102260] phy1: Removed STA 00:14:bf:3d:4d:12
[ 1655.103464] phy1: Adding new IBSS station 00:80:48:52:ff:9e (dev=eth1)
[ 1655.103468] phy1: Allocated STA 00:80:48:52:ff:9e
[ 1655.103472] phy1: Inserted STA 00:80:48:52:ff:9e
[ 1655.104592] phy1: Adding new IBSS station 00:14:bf:3d:4d:12 (dev=eth1)
[ 1655.104604] phy1: Allocated STA 00:14:bf:3d:4d:12
[ 1655.104610] phy1: Inserted STA 00:14:bf:3d:4d:12
[ 1655.105331] phy1: HW CONFIG: freq=2457
[ 1655.105821] phy1: Adding new IBSS station 00:0b:6b:20:22:fe (dev=eth1)
[ 1655.105826] phy1: Allocated STA 00:0b:6b:20:22:fe
[ 1655.105829] phy1: Inserted STA 00:0b:6b:20:22:fe
[ 1655.105847] RX beacon SA=00:80:48:52:ff:9e BSSID=02:ca:ff:ee:ba:be TSF=0x0 BCN=0xffff95d81ca981ed diff=116719550365203 @338776
[ 1655.105852] eth1: beacon TSF higher than local TSF - IBSS merge with BSSID 02:ca:ff:ee:ba:be
[ 1655.105856] phy1: Removed STA 00:0b:6b:20:22:fe
[ 1655.105859] phy1: Removed STA 00:14:bf:3d:4d:12
[ 1655.105861] phy1: Removed STA 00:80:48:52:ff:9e
[ 1655.106818] phy1: HW CONFIG: freq=2457
[ 1655.107303] phy1: Adding new IBSS station 00:80:48:52:ff:9e (dev=eth1)
[ 1655.107308] phy1: Allocated STA 00:80:48:52:ff:9e
[ 1655.107312] phy1: Inserted STA 00:80:48:52:ff:9e
[ 1655.107326] RX beacon SA=00:14:bf:3d:4d:12 BSSID=02:ca:ff:ee:ba:be TSF=0x0 BCN=0xffff95d81ca9d368 diff=116719550344344 @338776
[ 1655.107330] eth1: beacon TSF higher than local TSF - IBSS merge with BSSID 02:ca:ff:ee:ba:be
[ 1655.107334] phy1: Removed STA 00:80:48:52:ff:9e
[ 1655.108068] phy1: Destroyed STA 00:0b:6b:20:22:fe
[ 1655.108293] phy1: HW CONFIG: freq=2457
[ 1655.108772] phy1: Adding new IBSS station 00:14:bf:3d:4d:12 (dev=eth1)
[ 1655.108778] phy1: Allocated STA 00:14:bf:3d:4d:12
[ 1655.108782] phy1: Inserted STA 00:14:bf:3d:4d:12
[ 1655.118298] phy1: Destroyed STA 00:80:48:52:ff:9e
[ 1655.120046] phy1: Destroyed STA 00:14:bf:3d:4d:12
[ 1655.121053] phy1: Destroyed STA 00:0b:6b:20:22:fe
[ 1655.122713] phy1: Destroyed STA 00:14:bf:3d:4d:12
[ 1655.124048] phy1: Destroyed STA 00:80:48:52:ff:9e
[ 1655.124865] phy1: Destroyed STA 00:80:48:52:ff:9e
[ 1655.156680] phy1: Adding new IBSS station 00:80:48:52:ff:9e (dev=eth1)
[ 1655.156693] phy1: Allocated STA 00:80:48:52:ff:9e
[ 1655.156699] phy1: Inserted STA 00:80:48:52:ff:9e
[ 1655.205880] RX beacon SA=00:80:48:52:ff:9e BSSID=02:ca:ff:ee:ba:be TSF=0x0 BCN=0xffff95d81cab11ef diff=116719550262801 @338801
[ 1655.205893] eth1: beacon TSF higher than local TSF - IBSS merge with BSSID 02:ca:ff:ee:ba:be
[ 1655.205901] phy1: Removed STA 00:80:48:52:ff:9e
[ 1655.205906] phy1: Removed STA 00:14:bf:3d:4d:12
[ 1655.207037] phy1: Destroyed STA 00:80:48:52:ff:9e
[ 1655.207316] phy1: Adding new IBSS station 00:14:bf:3d:4d:12 (dev=eth1)
[ 1655.207324] phy1: Allocated STA 00:14:bf:3d:4d:12
[ 1655.207330] phy1: Inserted STA 00:14:bf:3d:4d:12
[ 1655.207965] phy1: HW CONFIG: freq=2457
[ 1655.208477] phy1: Adding new IBSS station 00:80:48:52:ff:9e (dev=eth1)
[ 1655.208482] phy1: Allocated STA 00:80:48:52:ff:9e
[ 1655.208485] phy1: Inserted STA 00:80:48:52:ff:9e
[ 1655.208504] RX beacon SA=00:14:bf:3d:4d:12 BSSID=02:ca:ff:ee:ba:be TSF=0x0 BCN=0xffff95d81cab64ba diff=116719550241606 @338802
[ 1655.208509] eth1: beacon TSF higher than local TSF - IBSS merge with BSSID 02:ca:ff:ee:ba:be

As you can see the card is merging, disassociating and merging five times within 0,106 seconds with a wireless ad-hoc cell it has already merged to long ago. Every time it gets another false timestamp it assumes that it is merging to a new IBSS-ID. The driver is purging it's MAC table from known stations, waiting to populate it again. No wonder I don't get more than 2-3 ICMP messages through.

The first trick is a non-standard hack: We fix the Cell-ID, rather than letting the cards negotiate one. We have modified the Madwifi driver and tricked the Broadcom driver to ignore any attempt to change the Cell-ID. "My Cell-ID is 02:CA:FF:EE:BA:BE - period." You can set the Cell-ID with the command:

 iwconfig ath0 ap 02:CA:FF:EE:BA:BE (or whatever you like best as a Cell-ID that is easy to remember) 

But still currently the Atheros cards try to synchronize their MAC timers, which results in race conditions of the software and the stuck beacon problem. There is now a workaround which will be added to Madwifi in OpenWRT and the Villagetelco firmware. With a trick we will make the cards ignore all attempts to shift the MAC timer - we just tell the hardware of the WiFi card that we are in accesspoint mode, which stops the card from trying to synchronize the MAC timer. Actually we don't need to synchronize the MAC timers in the hardware at all - this is better done in software, in the driver running on the host CPU of your PC.

Problem solved!

The problems with stuck beacons and race conditions triggered by MAC timer skews/attempted IBSS merges are fixed now in the Openwrt Kamikaze development trunk, in Kamikaze 8.09_RC1 and our Villagetelco development repository. Both IBSS mode and Pseudo-IBSS mode (a.k.a. Ad-hoc demo mode or Ahdemo mode, for short) are working fine now. I was running the DIR-300 for several days in the Freifunk mesh cloud without any issues.

The trick is to load the binary HAL for accesspoint mode upon initialization of the WiFi interface, rather than the HAL for IBSS mode. In accesspoint mode the card does not try to synchronize its MAC clock with any other wireless device in the cell - hence this functionality is missing in the HAL. In IBSS mode ('real' ad-hoc mode) this is done in software on the host system now, if you create the IBSS VAP (Virtual Access-Point) with the option nosbeacon:

 wlanconfig ath0 create wlandev wifi0 wlanmode adhoc nosbeacon 

Erroneous TSF timestamps received with beacons don't cause race conditions anymore. Pseudo-IBSS mode doesn't depend on sending beacons anyway, however there have been issues with this mode before.

Both modes have been tested in the Freifunk mesh cloud in Berlin without any stability problems for 72 hours each.

There is one problem however with TSF timestamps generated on the host CPU: The software generated timestamps are not as precise as the timestamps from the cards MAC clock, and they never will be. The communication between the Madwifi driver in the Linux kernel and the WiFi card is not real-time, and so there will be always lags that vary. It has been observed that this little TSF deviants confused devices with Broadcom chipsets, such as the Linksys WRT54GL: They were still working but they stopped sending beacons.

Apart from that Atheros devices with Madwifi operating in Ahdemo mode and Broadcom based devices with their closed-source driver operating in 'real' IBSS mode, work together nicely in the ~500 node community mesh that we use here in Berlin.

Personal comment: I have replaced the WiFi card in my Asus EEE PC 901 with a Atheros 802.11abg card and use it in the mesh about 15 hours every day - and I'm completely happy with it. I'm using the Madwifi driver sources from our Villagetelco repository on my PC.

'Real' Ad-Hoc (IBSS) mode versus Ah-Demo (Pseudo-IBSS) mode

As has been explained earlier on this page the 'Real' Ad-Hoc (IBSS) mode is complex and not usable in a large scale mesh ad-hoc cell. However there has always been the non-standard Ahdemo mode, which is not widely known. Ahdemo mode was (and maybe still is) popular amongst people setting up wireless long shots. The first world records of wireless long shots were achieved with Lucent/Orinoco 801.11b PCMCIA cards.

Ahdemo mode is supported only by a few chipsets and drivers, namely old Lucent Orinoco, Intersil Prism chipsets generation 2, 2.5, 3 (all 802.11b only) and Madwifi. However it is not guaranteed that the forementioned chipsets and their respective drivers will interoperate with each other in Ahdemo mode.

In Ahdemo mode cards don't send any beacons, and hence there is no process of IBSS merges. A Ahdemo cell doesn't reveal its presence to other wireless networks - because this detection works by receiving beacons. Usually the Cell-ID of Ahdemo cells is 00:00:00:00:00:00 but the Madwifi driver can utilize a fixed Cell-ID, that can be configured with the iwconfig option to fix the Cell-ID which has already been shown earlier on this page:

 iwconfig ath0 ap 02:CA:FF:EE:BA:BE 

Atheros cards with Madwifi driver operating in Ahdemo mode and Atheros cards with Madwifi driver operating in real Adhoc mode have no problems communicating with each other, as long as all cards use the same fixed Cell-ID.

In competition with another wireless network operating/colliding on the same channel the Madwifi Ahdemo mode is losing more throughput than Madwifi Ad-Hoc mode. If a Madwifi Ahdemo cell doesn't compete for the same channel with another network the Ahdemo performance is slightly better than in Adhoc mode. The reason is that Adhoc beacons are send at basic (slowest) rate and hence consume airtime which is reducing the channel capacity.

There is another advantage of Ahdemo: It is possible to configure a Madwifi interface to operate as accesspoint VAP and ahdemo VAP at the same time. The RO.B.I.N. and Nightwing firmware for Atheros AP51 based devices are using this capability. They use the Ahdemo VAP to create the mesh as a backbone and the accesspoint to give client access. This is a interesting concept but not in the scope of the Villagetelco project - since we are using the mesh only for telephony.

Example how to set up Madwifi VAPs

Here a brief instruction how to set up three VAPs (1 Ahdemo, 1 Master, 1 Monitor). A monitor VAP is mandatory to do advanced 802.11 traffic analysis.


wlanconfig ath0 destroy (if you already had a VAP instance running)
wlanconfig ath0 create wlandev wifi0 wlanmode ahdemo
wlanconfig ath1 create wlandev wifi0 wlanmode sta
wlanconfig ath2 create wlandev wifi0 wlanmode monitor

Recommended tools for 802.11 MAC analysis

You can use the monitor VAP to receive raw 802.11 packets, which will include 802.11 MAC, Radiotap or Prism headers for advanced 802.11 analysis. The Rolls-Royce of traffic monitoring software is Wireshark. The Porsche for mesh analysis is Horst [1]. Horst was designed to debug MAC problems of WiFi mesh networks. Also Kismet [2] should be mentioned at this point, of course.

Stay tuned!

Cheers elektra

1 comment:

  1. Thank you for your great information. I solved my "Cell Split" problem by:
    iwconfig ath0 ap xx:xx:xx:xx:xx:xx
    after ad-hoc configuration.
    Thank you very much.

    ReplyDelete