Re: bit errors on spitz

Previous thread: [PATCH 1/1] trace: Update the comm field in the right variable in update_max_tr by Arnaldo Carvalho de Melo on Friday, March 5, 2010 - 2:23 pm. (2 messages)

Next thread: [GIT PULL] hwmon updates for 2.6.34 by Jean Delvare on Friday, March 5, 2010 - 2:38 pm. (1 message)
From: Pavel Machek
Date: Friday, March 5, 2010 - 2:27 pm

Hi!

I'm getting way too many bit errors on spitz, with various
kernels.

It may be tied to network usage (bluetooth or wifi?). It happens even
on AC power.
									Pavel


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Eric Miao
Date: Sunday, March 7, 2010 - 10:37 pm

Pavel,

What kind of bit errors? I'm not using any network here on my spitz so
not sure what exactly was happening. Could you paste the dmesg here
so we can help take a look?

- eric
--

From: Pavel Machek
Date: Monday, March 8, 2010 - 12:28 am

dmesg would not be useful, it usually hits user programs. Like... mutt
suddenly displaying , instead of - in the header. Program failing to
start because function printg is not found (it was not exactly
printf->printg,  I don't remember exact symbol), ping complaining
about discarding corrupted packets, etc.

(Or of course, kernel oopsing or not going from suspend at all. But as
even user data are being corrupted, oops is not likely to be
interesting and system is typically not in state to capture it any more.)

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Cyril Hrubis
Date: Monday, March 8, 2010 - 1:25 am

Well I've seen empty lines when editing file with vim (these that are starting
with blue tilda) in the middle of file. And sometimes programs segfaults for no
good reason. Just today I've run "apt-get update" and got:

symbol lookup error: apt-get: undefined symbol: _ZN16pkgAcquireStatus4StopEv

While the correct symbol seems to be _ZN16pkgAcquireStatus4.

When running 'make' in kernel directory and closing the display sometimes
machine dies and nothing but reset under battery cover helps. I remeber waking
up in the morning, opening the device and reseting the device. And it seems to
be provoked much more by active CF wifi card.

-- 
metan
--

From: Stanislav Brabec
Date: Monday, March 8, 2010 - 4:48 am

And I have seen:

- Unreproducible SIGSEGV of gcc (while Wi-Fi connection over CF card was
  running).
- Unreproducible SIGSEGV of opkg (downloading via Wi-Fi connection over
  CF card).
- Unreproducible SIGSEGV of rm (called from find command launched via
  ssh, networking via Wi-Fi connection over CF card).
  (Hint: Tasks above are HDD-intensive.)
- Lost blocks while copying from CF to SD.
- Lost blocks while copying from HDD to SD.
- Lost blocks while copying from CF to USB flash stick.
- And I see display noise while CF Wi-Fi card is active.

These problems appear in all kernels, at least since 2.6.26.

There is no note in the syslog.


________________________________________________________________________
Stanislav Brabec
http://www.penguin.cz/~utx/zaurus

--

From: Cyril Hrubis
Date: Monday, March 8, 2010 - 5:16 am

Forgotten about this one. See for yourself, notice short black vertical lines
flashing randomly.

http://atrey.karlin.mff.cuni.cz/~metan/outgoing/zaurus_sickness.mpg

-- 
metan
--

From: Russell King - ARM Linux
Date: Monday, March 8, 2010 - 5:42 am

I haven't looked at the video.

Is this display rotated by 90 degrees?

If so, they're actually horizontal lines as far as the display scanning
is concerned - and that tends to suggest that there's insufficient system
bus bandwidth for all the activity taking place, and the LCD controller
is being starved of data.

I've seen similar (described) effects on SA1110 systems in past years
with low clock rates.

Some of the reports suggest that this happens with multiple kernel versions
and is not something new to the latest kernels.  Please confirm when the
problem started.
--

From: Cyril Hrubis
Date: Monday, March 8, 2010 - 6:32 am

Well, when doing 'echo 0 > /sys/class/graphics/fbcon/rotate_all' for 2.6.33
I've got the same problem but the lines are vertical. However 2.6.24 seems to

As far as I can test in rotated mode it happens for kernels from 2.6.24 to
2.6.33 (I haven't older kernel than 2.6.24 that boots on spitz). 

-- 
metan
--

From: Marek Vasut
Date: Wednesday, June 2, 2010 - 5:01 pm

This is not only case of spitz. I've seen LCD image falling apart on pxafb on 
Voipac PXA270 board. The image was like "torn in half and part of it was moved 
to right, the hole between staying white".

This happened exactly when I started doing a DMA transfer from a harddrive 
attached through pata_pxa. It's perfectly replicable. If I disabled DMA and let 
it run only in PIO, the image was fine.

I assume the corruption Pavel was seeing is related. My guess is the problems 
are caused when DMA between the CPU and a companion chip happens. I dunno if the 
DMA controller doesn't have enough power to supply LCD and the companion chip 
with data, but that's one of my guesses.

btw. Adjusting the DMA descriptor length in pata_pxa didn't help.

Guys, we need to investigate this as it seems to cause problems on many places.

Cheers!
--

From: Eric Miao
Date: Wednesday, June 2, 2010 - 7:30 pm

If there is a FIFO attached, may need the FIFO status when the error happens.
And a dump of the DMA registers would also be helpful.
--

From: Pavel Machek
Date: Sunday, March 21, 2010 - 1:40 pm

Interesting, I get memory corruption leading to strange
behaviour. Sometimes echo 3 > /proc/sys/vm/drop_caches helps...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andy Green
Date: Thursday, March 11, 2010 - 6:25 am

I saw very similar failures for a long time on our iMX31 based device. 
Eventually I found a Freescale errata where the RAM inside the USB2 
macrocell started to make single bit errors below 1.38V Vcore; ours was 
1.4V at that time but dipped on CPU load.

I cranked up the Vcore to 1.6V and that solved it, we also added some 
ceramic caps to Vcore to help with the dips.

So it might be worth looking at PMU arrangements for Vcore level / look 
for dips with a 'scope (despite this isn't an iMX31).

A characteristic of it was it never caused kernel issues, since the 
kernel didn't come over USB.  It only ever caused troubles on userspace 
stuff.

-Andy

--

From: Stanislav Brabec
Date: Thursday, March 11, 2010 - 8:42 am

Good tip. It seems that nobody ported driver for the voltage control
chip ISL6271 from 2.4 kernel, and bootloader probably does not set
correct values.

Datasheet:
http://www.penguin.cz/~utx/zaurus/datasheets/power/Xscale/ISL6271.pdf

-- 
Stanislav Brabec
http://www.penguin.cz/~utx/zaurus

--

From: Andy Green
Date: Thursday, March 11, 2010 - 2:21 pm

Unless there's more to it in the way the zaurus using it that regulator 
isn't programmable digitally.

Reading about your CF Card WLAN related issues they suck down a good 
amount of power when their radio is up, I would definitely suggest 
monitoring with a 'scope the various rails (Vcore, RAM and whatever it 
is the CF Card is powered by) while putting it under load.

-Andy
--

From: Stanislav Brabec
Date: Friday, March 12, 2010 - 2:07 am

OOPS, I made a mistake and linked ISL6721 instead of ISL6271 there.
Now it is fixed:
http://www.penguin.cz/~utx/zaurus/datasheets/power/XScale/ISL6271A.pdf

This one has I2C. It is connected to GPIO 3 (PWR_SCL) and GPIO 4
(PWR_SDA).

It is visible between the black plastic and the large circular coil:

I guess that Zaurus has a good power design and that voltage should be
constant enough. CF has a dedicated step down (plus 2.8V power detector
(Why so low, if CF standard requres more?)), HDD has a dedicated step
up/down. USB has dedicated step up. Companion chips use dedicated 3.3V
step down. Audio uses dedicated linear regulator. CPU has several
dedicated step downs, CPU 3.3V uses step-up to 5V and then down to 3.3V
(which is shared only with IOPORT).

Nearest common point between CF card power and CPU power is the battery.


________________________________________________________________________

Stanislav Brabec
http://www.penguin.cz/~utx/zaurus


--

From: Andy Green
Date: Friday, March 12, 2010 - 2:33 am

Thanks... that defaults to 1.3V on Vcore if you don't touch it.  I guess 

In that case is the PXA CF driver PIO?  Then it can be the same load on 
Vcore issue in disguise.

-Andy
--

From: Stanislav Brabec
Date: Friday, March 12, 2010 - 3:43 am

There is a proprietary ASIC chip (Sharp Scoop) that handles CF and HDD
access (and also several additional GPIOs):
http://www.penguin.cz/~utx/zaurus/datasheets/ASIC_S1L50752B26B200/412752.PDF

The ASIC runs in dual power mode. HVDD is powered from the 3.3V
dedicated to CF resp. HDD power supply (both may be turned off by the
kernel), LVDD is shared with CPU 3.3V (it is always on).

It seems that there are no other chips connected to the VCC_PLL,
VCC_SRAM and VCC_CORE.

VCC_DRAM is the same 3.3V ans CPU ans ASIC LVDD and also the same as
flash power and flash driver CPLD:
http://www.penguin.cz/~utx/zaurus/datasheets/memory/


________________________________________________________________________
Stanislav Brabec
http://www.penguin.cz/~utx/zaurus

--

From: Andy Green
Date: Friday, March 12, 2010 - 4:13 am

Right but not thinking about its power arrangements, rather the load on 
the CPU itself when it's transferring data to / from CF interface (via 
this ASIC).

If the ASIC has bus master DMA and that's used by the driver then fair 
enough, otherwise if it is done by PIO in the driver "while using CF" 
(as mentioned in most symptoms) becomes the same as saying "during 100% 
load on CPU" which is what leads to dents in Vcore and potential 
instability by that same Vcore path.

-Andy
--

From: Pavel Machek
Date: Sunday, March 21, 2010 - 1:43 pm

Are we sure about this one? If we have wrong voltages on various
parts, that kind-off explains it.

Would it be possible to measure (Voltmeter) difference between 2.4
kernel and 2.6 kernel?

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Stanislav Brabec
Date: Sunday, March 21, 2010 - 2:42 pm

If you are ready to run Zaurus in dismantled state, then yes. Measure on
the upper pin of the large coil in the center of the
http://www.penguin.cz/~utx/zaurus/pcbt_uc.jpg image or on the testpoint
nearby (probably to the right).

Alternatively, it is possible to write a driver. It is just one byte
write and one byte read via I2C.


________________________________________________________________________
Stanislav Brabec
http://www.penguin.cz/~utx/zaurus

--

From: Pavel Machek
Date: Monday, March 29, 2010 - 11:27 am

Do you know what byte it is? That sounds easy enough...

But I have small problem now -- zaurus seems to work mostly fine
now. Does it depend on temperature, or what? Tried mtest,
nothing. Tried compiling kernel, ok... Will try few more times...

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Stanislav Brabec
Date: Monday, March 29, 2010 - 11:52 am

Yes, it should be easy driver. One byte address and then one byte write
or one byte read.

See the datasheet:
http://www.penguin.cz/~utx/zaurus/datasheets/power/XScale/ISL6271A.pdf

Page 11: address
Pages 8 and 9: Data interpretation

-- 
Stanislav Brabec
http://www.penguin.cz/~utx/zaurus

--

Previous thread: [PATCH 1/1] trace: Update the comm field in the right variable in update_max_tr by Arnaldo Carvalho de Melo on Friday, March 5, 2010 - 2:23 pm. (2 messages)

Next thread: [GIT PULL] hwmon updates for 2.6.34 by Jean Delvare on Friday, March 5, 2010 - 2:38 pm. (1 message)