Multicore Processors

GWYN FISHER, CTO | WHITE PAPER | SEPTEMBER 2010
Developing Software in a
Multicore & Multiprocessor World
Tool-based approach for nding complex concurrency issues
and endian incompatibilities
To keep pace with customer demands for more functionality and speed,
software teams are moving away from single processor architectures at a rapid
rate. In particular, embedded devices that used to have one chip to perform
a constrained set of tasks are now working in heterogeneous processor
environments where processors are used for network connectivity, multi-media,
and a whole variety of requirements. According to new data from VDC Research,
this trend is only expected to accelerate: engineers expect that in two years time,
the number of single processor projects will drop by half.
The business impact of this growing complexity is stark: multicore and multiprocessor software
projects are 4.5X more expensive, have 25% longer schedules, and require almost 3X as many
software engineers.1
One area in particular where this growing complexity can have a dramatic impact on cost and schedule
overruns is in the area of software testing and code inspection. A multicore/processor environment can
add exponential complexity to effectively identifying errors in software. There are two areas in particular
that have the ability to drag the productivity of a software team through the floor: concurrency errors
and endian incompatibilities.
This whitepaper will discuss these types of issues in detail, explain how Klocworks source code
analysis engine, Klocwork Truepath can be used to address them, and walkthrough two examples
of these problems in prominent open source projects.
Current Project
Multicore and
multiprocessor
5.2%
Multicore
9.3%
Dont know
8.5%
Multicore and
multiprocessor
19.4%
Multiprocessor
20.8%
Expected in 2 Years
Dont know
2.9%
Single
processor
61.8%
Single
processor
30.1%
Multicore
21.4%
Multiprocessor
20.6%
Figure 1 | Processing
Architecture Used in
the Current Project
and Expected in Next
Two Years (Percent of
Respondents)
VDC Research, Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Costs Overruns, and
Software Development Challenges, September 2010.
Tackling Concurrency Issues and Endian Incompatibilities with Klocwork Truepath

Concurrency Issues
Source code analysis is a process by which the expected, or predicted behavior of a program at runtime is exercised, along
every conceivable control flow path, in order that aberrant situations be found, diagnosed, and described to the author in
such a way as to make them simple to fix. In the typical course of events, no timing information or order, other than that
inherent in the control flow graph, is interpreted or required for this analysis to take place.
Concurrency issues pose a complex set of challenges for analysis, as they do require timing or ordering information to be
promoted into the control flow graph. Some are obviously less difficult to find than others, such as threads that reserve locks
and perform time-consuming activities before releasing. This type of behavior, whilst not leading to a critical failure such as a
deadlock, can lead to frustration on the part of the end user of the software, for example in the face of an unresponsive device.
The more complex type of concurrency issues, such as deadlocks, require an additional type of analysis over-and-above that
performed when finding non-order-related bugs such as memory leaks or buffer overruns. In this case, we must perform two
different types of analysis: one that gathers and propagates lock lifecycle behavior, and another that can analyze the whole
program space and find conflicts in this behavior.
Klocwork Truepath makes this possible via the addition of a new concurrency analysis engine to its existing tool chain:
Compile
Symbolic logic
Concurrency
Emulate native
build
Analyze control
ow graph
Analyze lock
dependencies
Build control ow
graph
Perform dataow
analysis
Figure 2 | Klocwork
Truepath tool chain provides
concurrency analysis
engine after control flow
graph analysis and build
emulation.
In this figure you can see that data relating to lock lifecycles is gathered by the normal analysis engine, and once this has been
produced for all modules in the system, the whole program space is then analyzed by the new concurrency analysis engine so
that loops in the lifecycle graph can be found, which equate to deadlocks.
Consider a function that operates as follows:
lock_t Lock1, Lock2;
void foo(int x) {
if( x & 1 ) {

lock(Lock1);

lock(Lock2);
}
else

lock(Lock1);
}
You can easily see by inspection that when passed an odd number as its parameter, this function defines a dependency of
Lock2 upon Lock1. Failing an odd parameter, Lock1 is still reserved, but this time there is no dependency of Lock2 upon
Lock1 at the local scope, although there may still remain that dependency (or another) at an inter-procedural scope.
Therefore, we have two discrete types of questions to ask when performing the analysis:
1. Symbolic logic questions:
a. Is there a valid control flow that gets us to call function foo() with an odd parameter?
b. Is there a valid control flow that results in foo() being called with an even parameter followed by a call to
another function that results in another lock (e.g. Lock2) being reserved before Lock1 is released?
2. Lock dependency questions:
a. If either of these are so, is there any other situation in the programs natural control flow whereby a counterdependency of Lock1 upon Lock2 can be reached, potentially resulting in a deadlock?
Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 2
The first type of question is answered by Klocwork Truepaths symbolic logic engine during the normal course of program
analysis, just as any other type of defect is analyzed for inter-procedural data flows that can or cannot occur.
The second type of question is then answered by the concurrency analysis engine, fed by the collection of all possible
dependencies within the program space. The result is what tends to be a small set of incredibly difficult to find (manually), and
insanely difficult to understand (without a tool) deadlock scenarios that developers can triage and fix very quickly within the
natural course of their implementation tasks.
Endian Incompatibilities
Whilst it may be true that there are 10 kinds of people in the world, a switch from a little endian platform to a big endian platform
will muddy that impression considerably. An advisor of ours recently informed me with glee that hed finally set his MSB (having
passed his 64th birthday), but store that in nibble representation on an unexpected endian architecture and hed be regressing
to the nursery once more.
In short, endian representations affect how the host processor stores integral types in memory. Considering 32-bit integers,
each of which consists of four bytes of memory, the processor can chose to read and write those four bytes in a variety of
orders, although traditionally only two are used:
Little endian, in which the bytes are written in the order 0, 1, 2, 3

Big endian, in which the bytes are written in the order 3, 2, 1, 0
This picture becomes slightly muddied if the processor actually writes words at a time (this is mostly a fairly historical
representation now, but we mention it for completeness), and applies its endian assumptions to each word:
Little endian still writes bytes in the order 0, 1, 2, 3

Big endian, however, may now write bytes in the order 1, 0, 3, 2
However the processor stores and reads such types is entirely at its own discretion and the business of nobody else. Until,
that is, the developer directs the processor to write such data into a medium for transmission, as opposed to storage in
memory.
Transmission media, which could be sockets, files, pipes, or any other inter-processor vector (e.g. interrupts that cause data
to be written to the PCI-Express interface, or to the serial bus, or), are addressed by the processor in exactly the same way
as memory unless specifically told to do otherwise.
Thus, a big endian processor will write a 32-bit integer onto a socket in byte order 3, 2, 1, 0. If the CPU on the other end
of the socket uses a little endian architecture, then obviously a value written onto the socket will be interpreted completely
differently when read. For example, a value of 29, written by a big endian processor and read by a little endian processor will
be interpreted as 53,504 not a small correction by any means.
Preparing a program for use with heterogeneous processor architectures therefore involves finding every integral type that
ever hits a transmission vector that could legitimately target another processor and ensuring that the read/write operation
involved transforms the data into / from a neutral representation that both sides agree on. In a program of any size at all,
obviously this is a non-trivial task.
Klocwork Truepath can help developers in this task as it now includes the ability to validate type representation usage
symmetrically as those types cross transmission vector boundaries. That is, the data flow engine within Klocwork Truepath
automatically validates that types that are written directly to a transmission vector are subject to host-to-neutral format
transformation before the write operation takes place. Likewise, integral types read from a transmission vector are tracked to
ensure that they are appropriately transformed prior to the first attempted usage on the host.
For example, consider the following function:
void foo(int sock)
{
int x;
for( x = 0; x < 256; x++ )

if( send(sock, &x, sizeof int) < sizeof int)
return;
This simple function makes the basic assumption that the reader on the other end of its socket has the same processor
architecture as the sender. This might be true, or more accurately it might be true today, but what designer can ever look far
enough into the future to know that it will always be true, regardless of market shifts, great ideas that marketing interns have,
etc.
Klocwork Truepath, upon analysis of this function, will point out:
Value x is used in host byte order, but should be used in environment/network byte order.
A developer versed in inter-architectural development will naturally modify this function to transform the value of the variable
x prior to transmission:
void foo(int sock)
{
int x, xt;
for( x = 0; x < 256; x++ )

{
xt = htonl(x); // or some other suitable form
if( send(sock, &xt, sizeof int) < sizeof int)
return;
}
Likewise when it comes to reading information across a transmission vector, Klocwork Truepath traces the data flow of any
received integral types to ensure, in exactly the opposite way to sending, that any such values are transformed to host format
prior to their first usage.
Open Source Case Studies

Lock Contention: SQLite ca. 2006
Long addressed by the developers of this great open source project, a deadlock was reported in the execution of the
database engine and was traced to code that was specifically intended to guard against such an occurrence (as is usually the
case). Although complicated to understand, and certainly the eventual fix resulted in an almost total rewrite of the offending
module, requiring days or perhaps weeks of intense manual debugging and thought-modeling without a tool such as
Klocwork Insight, this very nasty bug was found and correctly described by Klocwork Truepath during an analysis that took
mere minutes.
Consider the requirement to implement a simplistic singleton recursive lock capability within an environment that doesnt
support such constructs. Using reference counting, we can quite simply guard the underlying non-recursive lock and manage
its lifecycle appropriately. Of course, this being a parallel world, we need to use another lock to guard the reference count that
were using to guard the real lock, making the implementation just a bit more complicated.
The design of this might look something like the following example:
lock_t lock1, lock2;
int refCount = 0;
void enter() {
reserve_lock(lock1);
if( refCount == 0 )

release_lock(lock1);
refCount++;
}
void leave()
{
refCount--;
if( refCount == 0 )

}
Now I can call enter() multiple times, simulating some of the capabilities of a true recursive lock, and as long as I remember to
call leave() an equal number of times the lifecycle of the underlying non-recursive lock is managed correctly:
void foo()
{

// real lock is reserved
enter();

if( i-really-want-to )
{

// only the reference count is affected

enter();

leave();
}

// now the real lock is released
leave();
}
Now consider the requirement to implement an abstraction over thread-specific data storage. To ensure safety when
allocating such a structure, the database engine uses the singleton recursive lock described above to protect its activities with
an implementation that simplifies as follows:
int tlsCreated = 0;
data_t* create_data()
{
static data_t* tls;
enter();

if( tlsCreated == 0 )

tls = create_thread_data();

tlsCreated = 1;
leave();
init_data(tls);

return tls;
}
To simple inspection, this appears quite correct as it calls leave() the same number of times as enter() and thus should be
considered well behaved. Unfortunately life in the parallel world is rarely simple to analyze, and this case is certainly more
complicated than it first appears.
Consider a two core CPU executing two threads, both calling create_data at very slight offsets in time.
The first thread lets call our threads Thread 1 and Thread 2 begins executing create_data() and successfully calls the
enter() function. This results in the underlying lock, lock 2, being reserved to Thread 1:
Thread 1
create_data()
enter()

refCount = 0
reserve(lock1)
reserve(lock2)
release(lock1)

refCount = 1
Now lets assume that Thread 2 begins its execution of create_data() during the time that Thread 1 is active, and before it
releases lock 1:
Thread 1
Thread 2
create_data()
enter()

refCount = 0
reserve(lock1)
reserve(lock2)
create_data()
enter()
release(lock1) reserve(lock1)
One further assumption makes the scenario whole: Thread 1 at this moment is interrupted by the operating system, losing its
time on chip. Crucially, this happens before the reference count is updated. (Check the implementation of enter() and youll
see that the author unfortunately left the reference count update outside of what is supposed to guard access to it.) As the
reference count will therefore still read zero for Thread 2, it will attempt to reserve lock 2, resulting in Thread 2 blocking (as
lock 2 is already owned by Thread 1):
Thread 1
Thread 2
create_data()
enter()

refCount = 0
reserve(lock1)
reserve(lock2)
create_data()
enter()
interrupted
refCount = 0
reserve(lock2)
blocked
Upon return from interrupt, Thread 1 is released and resumes execution where it left off, incrementing the reference count
and returning from the enter() function. Its execution of create_data() continues, leading to a call to the leave() function, which
unfortunately attempts to reserve lock 1 before doing anything else:
Thread 1
Thread 2
create_data()
enter()

refCount = 0
reserve(lock1)
reserve(lock2)
create_data()
enter()

interrupted
refCount = 0
reserve(lock2)

blocked

refCount = 1
return
leave()
reserve(lock1);
blocked
Due to the fact that Thread 2 is currently blocked, waiting on lock 2, and currently owns lock 1, Thread 1 will now block on its
own attempt to reserve lock 1.
In short, this is a classic lock-order inversion contention caused by a poorly guarded data item, which when subject to race
condition (being read by one thread whilst in the process of being updated by another) causes one thread to reserve locks in
order while the other thread attempts to reserve them out of order, resulting in a deadlock.
With the race condition fixed, this singleton will operate correctly, although as previously described the author actually chose
to completely rewrite this module, providing a more useful re-entrant mutual exclusion capability for multiple threads, i.e.
removing the singleton semantic.
Figure 3 | Source listing from

SQLite
Figure 4 | Control flow

description from Klocwork
Truepath
Endian Design Assumptions: PostgreSQL

In contrast to the situation described in relation to SQLite, the findings in this case study dont point to bugs in software as
much as they do to limited design decisions and the impact they have on how software is then constructed.
Specifically, when designing a multi-process application, the architect is faced with the fundamental decision of whether all of
those processes are going to be supported on one chip, or whether for the sake of scale or pure flexibility, the software will
support being deployed and executed on multiple chips / hosts / devices at once.
In the case of PostgreSQL, one of the processes detached from the main kernel is the statistics collector, something that
acts more or less as a performance monitor, allowing the DBA to understand whats going on within the kernel, without
necessarily impacting the performance of the kernel whilst running reports or monitors against those statistics. This provides a
nice analog for a typical application-layer process set that need to interact with each other, but which due to design could be
implemented to operate on either the same CPU / host or a completely different one.
To implement this low touch collection and reporting mechanism, the PostgreSQL designers chose to fork() a process,
presumably on the same CPU or multi-CPU package, and then use an asynchronous socket to transmit data from the
kernel process to the collector. Using the pgstat application, the DBA can then interact with whatever the child process has
collected at any point in time.
All of this is encoded within the module src/backend/postmaster/pgstat.c.
Because of the way that this fundamental decision was taken in this particular case, the designer chose to encode data
transmission between the kernel and the collector using host-native representation. For example:
Figure 5 | Data
representation
analysis in action
In this example, its simple to see the assumption in all its glory, as that data member msg.msg_hdr.m_size is read and used
directly off the wire, in what could be, but isnt in this case, network order.
Now lets assume that a new generation of designers revisit this decision and instead place emphasis on scale and flexibility
over ease of implementation. Now they decide to place the statistics collector process on an arbitrary node in the hardware
design, rather than on the same node as the kernel process.
With this decision in place, the assumption that network byte order and host byte order are the same can no longer be made
in general. Porting to this new assumption set could take significant time, both for developers and for the test crew, faced with
putting together a matrix of CPUs / hosts that embody the plethora of representations we can expect to support in the field.
Using a tool-driven approach, however, this entire effort can be collapsed to a single analysis pass, taking minutes in total, to see
a report of whats involved. In this case, the designers would be faced with the following endian vulnerabilities that would need to
be addressed (along with the obvious logistical issues around how to place the process on the right host/CPU, of course):
pgstats.c: line 1988: function pgstat_recvbuffer()
Value msg.msg_hdr.m_size is used in network order.
pgstats.c: line 1443: function pgstat_send()
Value *msg is used in host byte order.
These two simple issues might be thought of as the whole problem domain. However, looking further into what this module is
capable of, certain information can be persisted across sessions using a statistics file. If we further our decision to allow the
process to be spawned on heterogeneous hardware, we might well continue that spread by allowing different instantiations of
said process to occur on heterogeneous hardware, thus requiring persistent data to be endian safe:

pgstats.c: line 2556: function pgstat_read_statsfile()

Value format_id is used in environment byte order.

Similar errors can be found on line(s): 2610, 2684, 2717, 2740.

pgstats.c: line 2312: function pgstat_write_statsfile()

Value format_id is used in host byte order.

Similar errors can be found on line(s): 2351, 2384, 2411, 2412.
Armed with this information, the designer can make all required updates to remove endian vulnerability from their code in
one pass.
Conclusion
The complexity of this problem domain is vast, so theres no one solution, tool, or approach that will address all your problems.
Development teams need to equip themselves with good tools, smart design assumptions, and even smarter developers to
reconcile the feature race being demanded by the market and the underlying platform complexity that implies. When it comes
to selecting a tool, source code analysis should be on your shortlist as it offers a compelling mix of scalability, flexibility and the
abiltiy to address a broad set of issues that will help you to ensure the overall security and reliability of your code.
About the Author

Gwyn Fisher is the CTO of Klocwork and is responsible for guiding the companys technical direction and strategy. With nearly
20 years of global technology experience, Gwyn brings a valuable combination of vision, experience, and direct insight into
the developer perspective. With a background in formal grammars and computational linguistics, Gwyn has spent much
of his career working in the search and natural language domains, holding senior executive positions with companies like
Hummingbird, Fulcrum Technologies, PC DOCS and LumaPath. At Klocwork, Gwyn has returned to his original passion,
compiler theory, and is leveraging his experience and knowledge of the developer mindset to move the practical domain of
static analysis to the next level.
About Klocwork
Klocwork helps developers create more secure and reliable software. Our tools analyze source code on-the-fly, simplify
peer code reviews, and extend the life of complex software. Over 900 customers, including the biggest brands in the mobile
device, consumer electronics, medical technologies, telecom, military and aerospace sectors, have made Klocwork part of
their software development process. Thousands of software developers, architects, and development managers rely on our
tools everyday to improve their productivity while creating better software.
IN THE UNITED STATES:

15 New England Executive Park
Burlington, MA 01803
IN CANADA:
30 Edgewater Street, Suite 114
Ottawa, ON K2L 1V8
t: 1.866.556.2967
f: 613.836.9088
www.klocwork.com
Klocwork Inc. All rights reserved. Klocwork and Klocwork Truepath are registered trademarks of Klocwork Inc.

Multicore Processors

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Multicore Processors

Caricato da

Copyright:

Formati disponibili

GWYN FISHER, CTO | WHITE PAPER | SEPTEMBER 2010

Software Development Challenges, September 2010.

Tackling Concurrency Issues and Endian Incompatibilities with Klocwork Truepath

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 2

Little endian, in which the bytes are written in the order 0, 1, 2, 3

Little endian still writes bytes in the order 0, 1, 2, 3

for( x = 0; x < 256; x++ )

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 3

for( x = 0; x < 256; x++ )

Open Source Case Studies

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 4

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 5

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 6

Figure 3 | Source listing from

Figure 4 | Control flow

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 7

Endian Design Assumptions: PostgreSQL

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 8

About the Author

IN THE UNITED STATES:

Potrebbero piacerti anche