vgHome
Posting Rules
Disclaimer
Privacy Policy and Contacts
About Rootvg
vgBookmarks
old Bookmarks
vgForum
Recent Posts
Old Forums
vgGuides
AIX
FAQ
HOWTO
Service Bulletins
QuickRef AIX 5.2
QuickRef AIX 5.2 - Help
vgNews
HACMP routes after a switch failure
Home
Help
Login
Register
Welcome,
Guest
. Please
login
or
register
.
July 30, 2010, 12:10:46 PM
1 Hour
1 Day
1 Week
1 Month
Forever
Login with username, password and session length
ROOTVG
>
AIX
>
Applications
>
HACMP/PowerHA
>
HACMP routes after a switch failure
Pages: [
1
]
Go Down
« previous
next »
Print
Author
Topic: HACMP routes after a switch failure (Read 6577 times)
0 Members and 1 Guest are viewing this topic.
Michael
Administrator
Hero Member
Posts: 676
Re: HACMP routes after a switch failure
«
Reply #10 on:
July 21, 2009, 01:11:18 PM »
on vacation. I'll look into it later.
Logged
sbzx
Jr. Member
Posts: 13
Re: HACMP routes after a switch failure
«
Reply #9 on:
July 16, 2009, 11:14:08 AM »
Michael,
I don't think there is any problem with the highlighted addresses. It's how topology tells us that 192.168.160.46 may be configured on either node.
My config looks like this:
NODE mqs3:
Network diskhbnet_35
mqs3_hdisk6_01 /dev/hdisk6
Network serviceNet
mqs3 10.23.24.156
mqs5 10.23.24.147
mqs3b1 192.168.2.1
mqs3b2 192.168.3.1
NODE mqs5:
Network diskhbnet_35
mqs5_hdisk0_01 /dev/hdisk0
Network serviceNet
mqs3 10.23.24.156
mqs5 10.23.24.147
mqs5b2 192.168.3.3
mqs5b1 192.168.2.3
Logged
Michael
Administrator
Hero Member
Posts: 676
Re: HACMP routes after a switch failure
«
Reply #8 on:
April 16, 2009, 06:59:08 AM »
Quote from: Battenoff on March 30, 2009, 11:26:09 AM
root@primary:/ > /usr/es/sbin/cluster/utilities/cltopinfo
Cluster Name: ora_prd_cluster
Cluster Connection Authentication Mode: Standard
Cluster Message Authentication Mode: None
Cluster Message Encryption: None
Use Persistent Labels for Communication: No
There are 2 node(s) and 2 network(s) defined
NODE primary:
Network net_diskhb_01
primary_hdisk5_01 /dev/hdisk5
Network net_ether_01
production 192.168.160.46
primary_if2 10.0.2.1
primary_if1 10.0.1.1
NODE secondary:
Network net_diskhb_01
secondary_hdisk5_01 /dev/hdisk5
Network net_ether_01
production 192.168.160.46
secondary_if2 10.0.2.2
secondary_if1 10.0.1.2
Note the highlighted lines: it would appear you are using the 192.168.160.46 address on both nodes at the same time. This is going to give problems. I was blind before I guess.
«
Last Edit: April 16, 2009, 07:04:40 AM by Michael
»
Logged
Battenoff
New Member
Posts: 4
Re: HACMP routes after a switch failure
«
Reply #7 on:
April 07, 2009, 10:16:37 AM »
The system is configured with two cards, each card goes through a different switch. It was when one of the switches failed and hacmp failed the address to the other card, that we noticed the problem. What happened was that MOST, but not all traffic flowed as it should. It was only traffic that destined for the local network that failed. I have been trying to remember to dig out the routing tables at the time the machine was running incorrectly (the ones posted are the ones when the machine is doing what it is supposed to).
To describe the problem is probably easy because it doesn't need a switch failure, pulling the cable achieves the same result. We have network traffic using en0. I pull the cable and hacmp fails over to en1. Everything keeps working. I then plug en0 iback in and wait a few minutes. I then unplug en1 and again, the fail over works as expected BUT this time, only non-local network traffic goes through en0. The local network traffic gets divided beteen en1 and en0. Its only wen we change the status of en1 to down, that traffic is routed correctly.
First I thought it was the distribution preference, but that didn't change anything. I then thought it was dead gateway detection. I tried both passive and active, neither fixed the problem.
When I get back to work tomorrow, I will try and find a copy of the routing table when a card was disconnected.
Bernie
Logged
Michael
Administrator
Hero Member
Posts: 676
Re: HACMP routes after a switch failure
«
Reply #6 on:
April 07, 2009, 08:20:53 AM »
Well, let's think a bit about switch failure as well.
You have dual NIC's, dual switches, etc.. If you look atthe way everything is cabled is there - perhaps - an unintended SPOF (single point of failure).
I know of a case where some clients could communicate, but others could not because there was a switch interconnect failure. HACMP got the service interfaces into the same switch - but since there was no connectivity to the other switch some clients got lost.
The cluster saw itself as correct - but not all the clients agreed. So, my next global view would be to inspect the infrastructure for side-effects. HACMP cannot see outside of the node/cluster.
Logged
John R Peck
Administrator
Senior Member
Posts: 129
Re: HACMP routes after a switch failure
«
Reply #5 on:
April 02, 2009, 03:00:38 AM »
The command outputs you show are when you are having the problem or when it's working ?
In those outputs, each of your NIC physical ports appears to be in different logical network, by virtue of the subnet masks, so there won't be any confusion from the AIX end as to which port to go outbound on. (The other post you refer to was where I wondered if NICs were in the same logical network and packets can then only go out on the first interface listed in the routing table for that logical network - not the problem here.)
As we're not seeing anything wrong there, maybe the problem is on the client end or in the router/switch config ?
If you have the no setting for path MTU discovery, that will cause routes to appear in the netstat -rn for the discovered paths, although I don't see why that should be a problem in HA. We need more routing table outputs I think.
If the problem is, for whatever reason, that you have routes which are set pointing down the dead interface, maybe you can add something to an HA event script to remove them when an interface dies, or flush the route table, or something along those lines ?
Logged
Battenoff
New Member
Posts: 4
Re: HACMP routes after a switch failure
«
Reply #4 on:
April 01, 2009, 02:49:06 AM »
Hi Michael,
Thanks for your thoughts.
The application is an Oracle database and it is used by clients on different subnets so it uses the default gateway. In our current setup, the rman recovery database is on the other node in the cluster which is on the same network. When a switch failure ocurred and we got the fail over, end users didn't notice because the database was still available. When we ran an RMAN backup, it sometimes worked and sometimes didn't depending on if it got the good interface.
I was reading an earlier post http://rootvg.net/forums/index.php/topic,392.0.html which looked similar to my broblem in that it looked like the same error (switch failure) caused both our problems. In one of the replies (by John) it suggested it was caused by a problem with having two NICs on the same network which is like what we have. I was hoping that somehow, the persistnet address was casuing the problem and by changing the distribution preference to collocation with persistent label would do somethig. It didn't.
Hopefully, we will get our purchasing department sorted and have the bill paid so I can give the problem to IBM support. Its just so frustrating to see it looks like it should work, its just that it doesn't. I am thinking ethrchannel will be the way to go, but that is going to require downtime which is always a problem when you want high availability.
Cheers,
Bernie.
Logged
Michael
Administrator
Hero Member
Posts: 676
Re: HACMP routes after a switch failure
«
Reply #3 on:
March 30, 2009, 01:47:53 PM »
No, not obvious. Looks like I would have set it up except that I prefer persistent addresses in a different subnet than the production (service) addresses.
The question running thru my head as I try to understand your question better: are there applications within the cluster also using the default route. It is sounding like an application bound itself to the interface, rather than the address - and the address it needed to reach was direct (i.e. not routed) so it held the interface as definition. It has been years since I have done any IP network (interface) programming, so please forgive me if my vague memory (or is it wild imagination) are inaccurate.
Are these stand-alone nodes, and/or are these physical interfaces (it is partiton nodes that I forget to ask).
The only thing that stands out are the Oerrs (output errors) for both interfaces.
re: meaning of
==>
. That is going to be difficult to find an answer to. I have taken it to mean the default interface, but more recently I have heard that it may be the interface used by the interface network for broadcast addresses. Do not hold you breath on a definitive answer.
Logged
Battenoff
New Member
Posts: 4
Re: HACMP routes after a switch failure
«
Reply #2 on:
March 30, 2009, 11:26:09 AM »
Hopefully, the error is obvious. I changed the names and left the addresses as is since they are in the private address space.
One thing, slightly off topic, but what does the => in the routing table indicate? I could not find anything in documentation on it and guessed it was meant ot refer to another entry, but when the route was buggered, it was associated with the card connected to the bad switch.
root@primary:/ > /usr/es/sbin/cluster/utilities/cltopinfo
Cluster Name: ora_prd_cluster
Cluster Connection Authentication Mode: Standard
Cluster Message Authentication Mode: None
Cluster Message Encryption: None
Use Persistent Labels for Communication: No
There are 2 node(s) and 2 network(s) defined
NODE primary:
Network net_diskhb_01
primary_hdisk5_01 /dev/hdisk5
Network net_ether_01
production 192.168.160.46
primary_if2 10.0.2.1
primary_if1 10.0.1.1
NODE secondary:
Network net_diskhb_01
secondary_hdisk5_01 /dev/hdisk5
Network net_ether_01
production 192.168.160.46
secondary_if2 10.0.2.2
secondary_if1 10.0.1.2
Resource Group ora_prd_group
Startup Policy Online On Home Node Only
Fallover Policy Fallover To Next Priority Node In The List
Fallback Policy Never Fallback
Participating Nodes primary secondary
Service IP Label production
root@primary:/ > netstat -ni
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.11.25.bf.5a.d8 977875 0 302223 3 0
en0 1500 10.0.1 10.0.1.1 977875 0 302223 3 0
en1 1500 link#3 0.11.25.bf.5a.77 8338784 0 6419315 4 0
en1 1500 10.0.2 10.0.2.1 8338784 0 6419315 4 0
en1 1500 192.168.160 192.168.160.47 8338784 0 6419315 4 0
en1 1500 192.168.160 192.168.160.46 8338784 0 6419315 4 0
lo0 16896 link#1 296158 0 296507 0 0
lo0 16896 127 127.0.0.1 296158 0 296507 0 0
lo0 16896 ::1 296158 0 296507 0 0
root@primary:/ > netstat -nr
Routing tables
Destination Gateway Flags Refs Use If Exp Groups
Route Tree for Protocol Family 2 (Internet):
default 192.168.160.234 UG 26 5862416 en1 - -
10.0.1.0 10.0.1.1 UHSb 0 0 en0 - - =>
10.0.1/24 10.0.1.1 U 2 301844 en0 - -
10.0.1.1 127.0.0.1 UGHS 2 53424 lo0 - -
10.0.1.255 10.0.1.1 UHSb 0 4 en0 - -
10.0.2.0 10.0.2.1 UHSb 0 0 en1 - - =>
10.0.2/24 10.0.2.1 U 2 307716 en1 - -
10.0.2.1 127.0.0.1 UGHS 0 50209 lo0 - -
10.0.2.255 10.0.2.1 UHSb 0 4 en1 - -
127/8 127.0.0.1 U 9 172890 lo0 - -
192.168.160.0 192.168.160.47 UHSb 0 0 en1 - - =>
192.168.160/24 192.168.160.47 U 1 241283 en1 - -
192.168.160.46 127.0.0.1 UGHS 0 2921 lo0 - -
192.168.160.47 127.0.0.1 UGHS 6 17152 lo0 - -
192.168.160.255 192.168.160.47 UHSb 0 0 en1 - -
Route Tree for Protocol Family 24 (Internet v6):
::1 ::1 UH 0 0 lo0 - -
root@primary:/ > exit
«
Last Edit: March 30, 2009, 01:33:59 PM by Michael
»
Logged
Michael
Administrator
Hero Member
Posts: 676
Re: HACMP routes after a switch failure
«
Reply #1 on:
March 30, 2009, 08:05:23 AM »
The start would be output from the following commands: cltopinfo, netstat -ni and netstat -nr.
You may modify the address enough to make them "pseudo", but not so much that the network configuration changes.
And now I get some time to think
Logged
Battenoff
New Member
Posts: 4
HACMP routes after a switch failure
«
on:
March 30, 2009, 01:11:45 AM »
G'day all,
I have read through the posts here and have found some descriptons close to my own problem but they are not quite the same.
Firstly, I want to state at the outset that I know using Etherchannel would overcome the issue - but hacmp is supposed to do the work and high availability means high availability - we have it because getting down time agreed to is difficult.
Now I have got that off my chest, I will describe the problem. I have a 2 node hacmp cluster using IPAT. Its running 5.3 cluster software on an AIX 5.3 machine. Our persistent address is on the same subnet as our service address (192.168.160.0/24) and the boot addresses on the two cards are on different subnets to each other and to the service address (10.0.1.0/24, 10.0.2.0/24). When the switch failed last year, the service address and the gateway moved to the working switch. However, looking at the routing table, we still had a route using the faulty interface (en1). Consequently, every second ip connection on the subnet of the service address (192.168.160) seemed to be sent to the bad route. I did try the boot address subnet and it worked fine but can't guarantee how well I checked it - limited down time meant not much time to test different scenarios. Everything that went thorugh the default gateway worked so that clients didn't notice any difference.
I did some research and found seemingly related posts on the internet that kind of suggested the sort of behaviour we were seeing could be fixed by using dead gateway detection - I tried both active_dgd on the route and passive_dgd, neither made a difference.
Reading thorugh the posts here, it seems the problem is our choice of addresses and networks. At the same time as all this, our accounting has somehow managed to not pay our HACMP maintenance so I can't run this by IBM.
Does anyone have a suggestion of what might be the cause and a potential fix to our problems?
Any insight or comments would be appreciated!
Cheers,
Bernie.
Logged
Pages: [
1
]
Go Up
Print
« previous
next »
Jump to:
Please select a destination:
-----------------------------
AIX
-----------------------------
=> AIX7 Beta
=> Administration
===> AIX6 Implementation and Administration
===> Security
=> Virtualization
=> Applications
===> HACMP/PowerHA
===> OSD - Open Source Development
-----------------------------
Hardware
-----------------------------
=> Power6
=> Power5
=> Power4
=> RS/6000 (Power III and earlier)
-----------------------------
Linux on POWER
-----------------------------
=> Planning and Installation
=> General
-----------------------------
Announcements
-----------------------------
=> Announcements
=> Discussion
Loading...
src="http://e1.extreme-dm.com/s10.g?login=jootvg&j=n&jv=n" />
Terms of Use
and
Privacy and Security Policies
Copyright 2001-2010 Michael Felt, John R Peck and ROOTVG.NET