Sei sulla pagina 1di 17

jtree corruption

Attachments:1
Added by Dan Rautio, last edited by Dan Rautio on May 31, 2013 (view change)

jtree corruption papers and ppt - rchip wedge, RKME, etc


jtree corruption PRs
Stoli specific jtree corruption
Information to gather when there is jtree corruption
list of commands to run periodically to look for the first signs of jtree
corruption
if you see jtree corruption, capture a live FPC core
if you have jtree corruption, you might get some "Detected error nexthop"
syslogs.
Use the above information to plug into jsim (both on-chip and shadow
copy) and run on the FPC before rebooting it
In addition, it is very important to know the topology chain the prefix is
using. follow the topology chain in the PFE:

jtree corruption papers and ppt - rchip wedge, RKME, etc


jtree corruption
trouble with gibson
rchip wedge, RKME errors
RKME errors FRF.16

jtree corruption PRs


PR description fixed notes

PR- seeing "jtree memory free using 10.2R2 indirect-next-hop


524215 incorrect value 14 correct 0" error related, fixed only
messages after offlined a PIC in LMNR -
cosmetic

PR- "jtree memory free using incorrect open PR-524251 fix


562719 value 8 correct 0" messages seen ported to I and ABC
for all DPCs when MPC was - cosmetic
installed and OSPF and ISIS was
activated simultaneously

PR- Bellini :: "jtree memory free using dup of PR-562719 early warning from
534336 incorrect value 2 correct 0" systest on i-chip
observed while testing VPLS over platform
AE

PR- pfe jtree corruption with indirect- 10.2R3.1 bug introduced in


545142 next-hop knob used and link flaps 10.2+
PR- seeing "jtree memory free using dup of PR-525142 duplicate
545624 incorrect value 14 correct 0" error
messages

PR- JUNOS releases at 10.2 or above cosmetic only


588212 with ES FPCs report
SRCHIP/SLCHIP errors when a
loopback firewall filter is applied

PR- PDT-CBU: getting error messages


577710 fpc4 jtree memory free using
incorrect value 2 correct 0 after
flapping interfaces

PR- nh_index_tree_lookup error on


587369 ES-FPC // nh_index_tree_lookup :
%PFE-3: nh_index_tree_lookup
failed for nh_index 1764336,
nh_destmask 1

PR- Jtree corruption and SRchip wedge It's not always an error if we don't find the HAL related 10.2+
587369 after ae lacp bundle activation on entry corresponding to provided arguments.
remote side So one should let the caller handle the
failure. This error message wasn't there in
pre-HAL code, and that's the right way.

PR- jtree corruption RKME/DESRD This PR details another way of hitting PR


661950 errors on Stoli 548436. configurations where prefixes
learned through IGP are pointing directly at
CBF LSPs, a jtree corruption can occur
(resulting in packet loss) upon topology
events (member link flap in aggregate or
remote end of member links flapping or
FPC reboots on local/remote end or route
flaps) Workaround of PR 548436 only
addresses routes learned with BGP.

PR- Jtree corruption and SRchip wedge Indexed nexthop (CBF) related
548436 after ae lacp bundle activation on
remote side

PR- Need to generate appropriate alarm Improve jtree corruption reporting and
665872 upon RKME errors alarming

PR- NH HAL: Topo-walk frees context This is a day-1 bug in HAL code. HALP HAL related 10.2+
558645 before doing a back propagate context is freed before doing a topo change
event back propagation. This affects
Indirect, Unilist, Composites which depend
on the HALP state.
PR- "show jtree [0] debug ip check on-
665191 chip [table]" might corrupt shadow
copy

PR- JTREE corruption on multiple fpc This is a duplicate of PR-558645 HAL related 10.2+
665194 and RCHIP wedge

PR- jtree memory corruption when the This is day one problem with HAL code. HAL related 10.2+.
707453 interface flaps Got surfaced with ECMP-64 knob. rchip rchip version of PR-
specific fix 743323

PR- jtree memory corruption when the This is day one problem with HAL code. HAL related 10.2+.
743323 interface flaps at ICHIP FPC Got surfaced with ECMP-64 knob. I-chip I-chip version of
specific fix PR-707453

PR- Multiple free errors were seen this bug was introduced as a result of the
676021 during route change event for fix for 597941
VPLS routes.

PR- related to 'route-memory- RCCA HAL related 10.2+


756464 enhanced', 'indirect-next-hop', and
per-prefix load balancing

PR- related to 'route-memory- route-memory-enhanced and per-prefix-lb ichip specific


695336 enhanced', 'indirect-next-hop', and required
per-prefix load balancing

PR- Invalid PDP errors and FPC caused by PR-554725 PR-867543 is a


818021 crashes can occur when CBF duplicate PR
enabled

PR- l3vpn composite and ECMP


771963 related

PR- JTREE corruption aggregate kernel-pfe sync faq


800066 interface link flap, later all DPCs
crashed in
mtj_nh_free_jtree_context

PR- Jtree corruption with log fpc4 RT: Dup PR/832414 likely fixed
858886 %PFE-3: Failed prefix change by PR/730686
IPv4 & jt_change failed for ifl 0 ,PR/710702
on FE 0, rt_jtree_change() , PR/738922
,PR/695366

Stoli specific jtree corruption


http://confluence.jnpr.net/confluence/display/IPGE/Stinger+warts
Information to gather when there is jtree corruption
list of commands to run periodically to look for the first signs of jtree corruption
--------------------------------
# no more RKME/DESRD errors:
show log messages | match "RKME|DESRD"
show route summary
show route forwarding-table summary
show bgp summary

# from the shell


cprod -A fpc<x> -c "show jtree 0 table tid 0" > tabledump.0

start shell pfe network fpcX


# if there are 2 PFEs, show both jtree 0 and 1
# no ongoing truncated key/btt/stack *flow:
show pfe statistics traffic
# gather some info
show jtree 0 summary
show jtree 1 summary
show jtree 0 onchip summary
show jtree 1 onchip summary
show jtree 0 memory extensive
show jtree 1 memory extensive
# no errors from below commands:
show jtree 0 debug ip check table
show jtree 1 debug ip check table
show jtree 0 debug ip on-chip check table
show jtree 1 debug ip on-chip check table
show jtree 0 debug ip check jumptable
show jtree 1 debug ip check jumptable
show jtree 0 debug ip on-chip check jumptable
show jtree 1 debug ip on-chip check jumptable
# err_crc_count* = 0 on below commands:
show nchip 1 dbuf
show nchip 3 dbuf
# no dbtt_cntr errors:
show rchip 0 counters
show rchip 1 counters
# no errors:
show rchip 0 statistics errors
show rchip 1 statistics errors
# no ECC:
show rchip 0 statistics dma
show rchip 1 statistics dma
# track IPC messages to PFE
show nhdb management all
-----------------------------------

if you see jtree corruption, capture a live FPC core


LCC0-EGFPC5(flame-lcc0-re0 vty)# write core
[Jun 24 21:54:27.393 LOG: Info] Dumping core-LCC0-EGFPC5 to 16777217

LCC0-EGFPC5(flame-lcc0-re0 vty)# [Jun 24 21:59:24.159 LOG: Info] Coredump


finished!

LCC0-EGFPC5(flame-lcc0-re0 vty)#

if you have jtree corruption, you might get some "Detected error nexthop" syslogs.
use Josef's document to decode the packet.
Or see my detected error nexthop cheat sheet )
here are some examples:
jtac-bbuild01: /volume/CSdata/kmuthuswamy/cases/vz/vz-t/2011-0621-
0876/log/sfc-re0>zmore messages.4.gz | grep lcc0-fpc5
Jun 21 19:42:51 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: Detected error
nexthop:
Jun 21 19:42:51 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: notif[ 0]: 38080034
00128050 7dfffdff 0cd2d279
Jun 21 19:42:51 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: notif[ 4]: 00000000
00000000 00000000 00000000
Jun 21 19:42:51 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: notif[ 8]: 00000000
Jun 21 19:42:52 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: packet[ 0]: 45 00 00
28 40 aa 00 00 1d 06 31 ad 47 f0 df 2c
Jun 21 19:42:52 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: packet[16]: ad d2 56
8a 08 4c 0d 3d 23 1e a0 1a c5 bf b2 51
Jun 21 19:42:52 RES-BB-RTR1 lcc0-master lcc0-fpc5 PFE: packet[32]: 50 10 78
00 bb 87 00 00

notif[1] = 0x00128050
0000 0000 0001 0010 1000 0000 0101 0000
<-------iif---------->
iif = 74 = 0x4a

notif [3] = 0x0cd2d279


0000 1100 1101 0010 1101 0010 0111 1001
<---------------sadr--------------->

packet:
45 - version
00 - tos
00 28 - total length
40 aa - id
00 00 - flags <-
1d - ttl <- 29
06 - protocol <- tcp
31 ad - checksum
47 f0 df 2c - source address <- 71.240.223.44
ad d2 56 8a - destination address <- 173.210.86.138
08 4c - tcp source port <- 2124
0d 3d - tcp destination port <- 3389
23 1e a0 1a - tcp seq number
c5 bf 50 10 - tcp ack number

jtac-bbuild01: /volume/CSdata/kmuthuswamy/cases/vz/vz-t/2011-0621-
0876/log/sfc-re0>zmore messages.4.gz | grep lcc1-fpc1
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: Detected error
nexthop:
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: notif[ 0]: 39080034
00200050 fdfffdff 0d62fe95
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: notif[ 4]: 00000000
00000000 00000000 00000000
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: notif[ 8]: 00000000
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: packet[ 0]: 45 00 00
28 19 cf 40 00 3d 06 ba 86 4a 6b 52 c8
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: packet[16]: cc d1 ff
75 07 87 00 50 b2 31 2c 98 1d 17 9e b6
Jun 21 19:42:51 RES-BB-RTR1 lcc1-master lcc1-fpc1 PFE: packet[32]: 50 10 80
00 23 eb 00 00

notif[1] = 0x00200050
0000 0000 0010 0000 0000 0000 0101 0000
<-------iif---------->
iif = 128 = 0x80

notif [3] = 0x0d62fe95


0000 1101 0110 0010 1111 1110 1001 0101
<---------------sadr--------------->

packet:
45 - version
00 - tos
00 28 - total length
19 cf - id
40 00 - flags <- don't fragment
3d - ttl <- 61
06 - protocol <- tcp
ba 86 - checksum
4a 6b 52 c8 - source address <- 74.107.82.200
cc d1 ff 75 - destination address <- 204.209.255.117
07 87 - tcp source port <- 1927
00 50 - tcp destination port <- 80
b2 31 2c 98 - tcp seq number
1d 17 9e b6 - tcp ack number

Use the above information to plug into jsim (both on-chip and shadow copy) and run on
the FPC before rebooting it
from a different example, the "Detected error nexthop" messages were decoded and were found
to be:
$route $iFPC/$PFE $intf $iif
38.251.140.24 FPC0/0 xe-0/0/0 93
194.6.182.149 FPC1/0 xe-1/0/0 112
69.212.198.149 FPC1/1 xe-1/1/0 116
169.181.194.109 FPC2/1 xe-2/1/0 130

now use the above information and run jsim on the affected fpc:
# for each of the row in above table, could you run below commands:

start shell pfe network $iFPC


show jtree 0 debug ip check table
show jtree 1 debug ip check table
show jtree 0 debug ip on-chip check table
show jtree 1 debug ip on-chip check table
show jtree 0 debug ip check jumptable
show jtree 1 debug ip check jumptable
show jtree 0 debug ip on-chip check jumptable
show jtree 1 debug ip on-chip check jumptable
jsim reset full $PFE
set jsim ipsrc 1.1.1.1
set jsim ipdst $route
set jsim iif $iif
set jsim ip-protocol 6
jsim lookup on-chip verbose
jsim lookup verbose
quit

In addition, it is very important to know the topology chain the prefix is using. follow the
topology chain in the PFE:
# for each of the row in above table, could you run below commands:

show route $route detail


# above will give you the mask to use
show route forwarding-table destination $route/mask detail

start shell pfe network $iFPC


show topology route ip prefix $route/mask
show nhdb id <nhdb id>
- do the above many time following the topology chain
show topology nhdb id <nhdb id>

here is an example
Route: 173.210.86.138

too long# show route ip prefix 173.210.86.138


show route ip prefix 173.210.86.138

IPv4 Route Table 0, default.0, 0x0:


Destination NH IP Addr Type NH ID Interface
------------ --------------- -------- ----- ---------
173.210.86/24 Unilist 2097957 RT-ifl 0
ae1.0 ifl 68

too long# show nhdb id 2097957


show nhdb id 2097957
ID Type Interface Next Hop Addr Protocol Encap
MTU Flags PFE internal Flags
----- -------- ------------- --------------- ---------- ------------ --
-- ---------- --------------------
2097957 Unilist - - IPv4 -
0 0x00000000 0x00000000

too long# show nhdb id 2097957 ext


show nhdb id 2097957 ext
ID Type Interface Next Hop Addr Protocol Encap
MTU Flags PFE internal Flags
----- -------- ------------- --------------- ---------- ------------ --
-- ---------- --------------------
2097957 Unilist - - IPv4 -
0 0x00000000 0x00000000
Nexthop Status:
Index: 0 (0x0), Slot: Unspecified
Destination (0x100): RE Chip8
Topo link: 0x8ff24780 0x89b9c780
Refcount: 181720, Interface: 68, Nexthop Flags: 0x0
Routing-table id: 0

[Allocated entries (jtree memory): 0]


Unilist Table (2 entries): List flags 0x0

794 Aggreg. ae1.0 - IPv4 Ethernet


0 0x00000000 0x00000840
800 Aggreg. ae2.0 - IPv4 Ethernet
0 0x00000000 0x00000840

Weight Info: Current Weight = 0


ID Balance Orig-Balance Weight Orig-Weight State Install
Flags
----- ------- --------- ------ ----------- -------- ----------- -----
794 0 0 0 0 Active Installed 0x00
800 0 0 0 0 Active Installed 0x00

too long#

too long# show nhdb id 794


show nhdb id 794
ID Type Interface Next Hop Addr Protocol Encap
MTU Flags PFE internal Flags
----- -------- ------------- --------------- ---------- ------------ --
-- ---------- --------------------
794 Aggreg. ae1.0 - IPv4 Ethernet
0 0x00000000 0x00000840

too long# show nhdb id 794 ext


show nhdb id 794 ext
ID Type Interface Next Hop Addr Protocol Encap
MTU Flags PFE internal Flags
----- -------- ------------- --------------- ---------- ------------ --
-- ---------- --------------------
794 Aggreg. ae1.0 - IPv4 Ethernet
0 0x00000000 0x00000840

Nexthop Status:
Index: 0 (0x0), Slot: Unspecified
Destination (0x100): RE Chip8
Topo link: 0x35eecec3 0x8be986c3
Refcount: 53714, Interface: 68, Nexthop Flags: 0x0
Routing-table id: 0

[Allocated entries (jtree memory): 0]


Aggreg.-Container Table (5 entries): List flags 0x0

795 Unicast xe-1/0/1.0 152.63.32.141 IPv4 Ethernet


0 0x00000000 0x00000000
796 Unicast xe-10/1/2.0 152.63.32.141 IPv4 Ethernet
0 0x00000000 0x00000000
797 Unicast xe-10/1/3.0 152.63.32.141 IPv4 Ethernet
0 0x00000000 0x00000000
798 Unicast xe-12/0/2.0 152.63.32.141 IPv4 Ethernet
0 0x00000000 0x00000000
799 Unicast xe-13/0/1.0 152.63.32.141 IPv4 Ethernet
0 0x00000000 0x00000000

Weight Info: Current Weight = 1


ID Balance Orig-Balance Weight Orig-Weight State Install
Flags
----- ------- --------- ------ ----------- -------- ----------- -----
795 0 0 1 1 Active Installed 0x00
796 0 0 1 1 Active Installed 0x00
797 0 0 1 1 Active Installed 0x00
798 0 0 1 1 Active Installed 0x00
799 0 0 1 1 Active Installed 0x00

need to also get the nh info for all the unicast nhs:
too long# show nhdb id 795 ext
too long# show nhdb id 796 ext
too long# show nhdb id 797 ext
too long# show nhdb id 798 ext
too long# show nhdb id 799 ext

too long# show nhdb id 800 ext


show nhdb id 800 ext
ID Type Interface Next Hop Addr Protocol Encap
MTU Flags PFE internal Flags
----- -------- ------------- --------------- ---------- ------------ --
-- ---------- --------------------
800 Aggreg. ae2.0 - IPv4 Ethernet
0 0x00000000 0x00000840

Nexthop Status:
Index: 0 (0x0), Slot: Unspecified
Destination (0x100): RE Chip8
Topo link: 0x35ee7ec3 0x8be90ec3
Refcount: 36340, Interface: 69, Nexthop Flags: 0x0
Routing-table id: 0

[Allocated entries (jtree memory): 0]


Aggreg.-Container Table (5 entries): List flags 0x0

801 Unicast xe-1/0/2.0 152.63.34.21 IPv4 Ethernet


0 0x00000000 0x00000000
802 Unicast xe-10/1/0.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000
803 Unicast xe-10/1/1.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000
804 Unicast xe-12/0/0.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000
805 Unicast xe-12/0/1.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000
Weight Info: Current Weight = 1
ID Balance Orig-Balance Weight Orig-Weight State Install
Flags
----- ------- --------- ------ ----------- -------- ----------- -----
801 0 0 1 1 Active Installed 0x00
802 0 0 1 1 Active Installed 0x00
803 0 0 1 1 Active Installed 0x00
804 Unicast xe-12/0/0.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000
805 Unicast xe-12/0/1.0 152.63.34.21 IPv4 Ethernet
0 0x00000000 0x00000000

Weight Info: Current Weight = 1


ID Balance Orig-Balance Weight Orig-Weight State Install
Flags
----- ------- --------- ------ ----------- -------- ----------- -----
801 0 0 1 1 Active Installed 0x00
802 0 0 1 1 Active Installed 0x00
803 0 0 1 1 Active Installed 0x00
804 0 0 1 1 Active Installed 0x00
805 0 0 1 1 Active Installed 0x00

need to also get the nh info for all the unicast nhs:

too long# show nhdb id 801 ext


too long# show nhdb id 802 ext
too long# show nhdb id 803 ext
too long# show nhdb id 804 ext
too long# show nhdb id 805 ext

too long# show topology nhdb id 2097957


show topology nhdb id 2097957
Topology: nh(Unilist,2097957)
Flavor: nexthop (8), Refcount 181720, Flags 0x1
Addr: 0xb142488, Next: 0x0, Context 0xb142470
Link 0: 8ff24780, Offset -1, Next: 00000000
Link 1: 89b9c780, Offset -1, Next: 00000000

Topology Neighbors:
41.218.192/18-> nh(Unilist,2097957)
213.185.80/20-+
46.182.52/22-+
143.86.8/24-+
209.136.29/24-+
209.136.30/24-+
144.104.96/20-+
[..]

Analyzing Jtree issues

Added by Siddharth Sudhakar, last edited by Siddharth Sudhakar on Jul 22, 2009 (view change)

Analyzing Jtree issues


Diagnose memory fragmentation on PFE JTREE memory
ADPC8(obelix-re0 vty)# show jtree 1 memory extensive
Jtree memory segment 0 (Context: 0x3641d70)
-------------------------------------------
Memory Statistics:
16777216 bytes total
16747000 bytes used
17992 bytes available (0 bytes from free pages) <<<<<<<<
5040 bytes wasted
7184 bytes unusable
32768 pages total
32614 pages used (2478 pages used in page alloc)
154 pages partially used
0 pages free (max contiguous = 0) <<<<<<<<
Partially Filled Pages (In bytes):-
Unit Avail Overhead
24 408 16
32 480 0
40 400 7168
64 16704 0

Free Page Lists(Pg Size = 512 bytes):-


Page Bucket Avail(Bytes)

Fragmentation Index = 0.996, (largest free = 64) <<<<<<<<


Counters:
1370763 allocs (271777 failed)
52552 releases(partial 0)
899598 frees
0 holds
0 pending frees(pending bytes 0)
0 pending forced
0 times free blocked
0 sync writes
Error Counters:-
0 bad params
0 failed frees
0 bad cookie

Check what is filling up the memory


ADPC8(obelix-re0 vty)# show jtree 1 memory extensive composition
App Bytes Allocs Frees
------------ -------- -------- --------
FW 91568 3816 0
FW Policer 0 0 0
Jtree 631272 800032 747108
NH 11098416 13529366 12717502
NH Policer 308544 6430 2
NH Cntr 0 0 0
NH Multi 0 0 0
NH Mcast 123872 4097362 4082574
PDP 72 3 0
Itable 9348880 146288 0
Test 0 0 0

ADPC8(obelix-re0 vty)# show rsmon


category instance type total lwm_limit hwm_limit free
-------- ----------- ------------ -------- --------- --------- --------
jtree jtree0-seg0 free-pages 32768 1638 4915 0 <<<<
jtree jtree0-seg0 free-dwords 2097152 104857 314572 0
jtree jtree0-seg1 free-pages 32768 1638 4915 23227
jtree jtree0-seg1 free-dwords 2097152 104857 314572 1486528
jtree jtree1-seg0 free-pages 32768 1638 4915 0 <<<<
jtree jtree1-seg0 free-dwords 2097152 104857 314572 0
jtree jtree1-seg1 free-pages 32768 1638 4915 23105
jtree jtree1-seg1 free-dwords 2097152 104857 314572 1478720
jtree jtree2-seg0 free-pages 32768 1638 4915 0
jtree jtree2-seg0 free-dwords 2097152 104857 314572 0
jtree jtree2-seg1 free-pages 32768 1638 4915 23176
jtree jtree2-seg1 free-dwords 2097152 104857 314572 1483264
jtree jtree3-seg0 free-pages 32768 1638 4915 0
jtree jtree3-seg0 free-dwords 2097152 104857 314572 0
jtree jtree3-seg1 free-pages 32768 1638 4915 23387
jtree jtree3-seg1 free-dwords 2097152 104857 314572 1496768
ADPC8(obelix-re0 vty)# show nhdb management operations

Next Hop Statistics:


Type Adds Changes Deletes Failures
------------ ---------- ---------- ---------- ----------
Discard 688731 310351 682639 0
Reject 4015 0 0 0
Unicast 2822828 1581906 2798299 0
Unilist 256442 0 254524 0
Indexed 0 0 0 0
Indirect 374966 255146 369264 0
Hold 19 17 17 0
Resolve 2415 0 2 0
Local 2423 0 2 0
Receive 6021 0 2 0
MultiRT 0 0 0 0
Bcast 2816 0 2 0
Mcast 4011 0 0 0
Mgroup 2 0 0 0
MDiscard 4011 0 0 0
RtTable 5918 0 0 0
Unknown 0 0 0 0
Aggreg. 1495701 892435 1485998 0
Crypto 0 0 0 0
Unknown 0 0 0 0
Sample 0 0 0 0
Flood 6499005 2151048 5326145 390538 <<<<
Service 0 0 0 0
Unknown 0 0 0 0
Compst 0 0 0 379825 <<<<
DmxRslv 0 0 0 0
DmxIFL 0 0 0 0
DmxtDef 0 0 0 0

Potrebbero piacerti anche