16

I have noticed in documentation, looking at open-source code bases, and just a general sense in the industry that C# developers largely prefer to use List<> or IEnumerable<> over simple array[] types, and I'm wondering if my preference for array type is misguided.

For some background, most of the time, I'm writing back-end business software; the kind of API or service that reads data from some data source, and sends back objects of that data to the caller. Most of that time, I'm doing a query from an SQL database or similar object store where the methods of access are pretty much distinct CRUD operations. There isn't much processing or derivation of data (like map & reduce or similar) going on, kind of ever. For example, I might write a method that does "search the product index and return matching products". Or "get me customer's orders between two dates".

These kinds of access patterns do not imply any sort of modification need, where middle-tier code would have any business removing or adding items to the collection of things returned, its job might be to enrich the data or convert it to another format, generally only passing it through, or maybe iterating on it for some purpose. Generally though, the size of these arrays never changes.

I was always taught that when designing a software component, you want to keep its context to the minimum viable amount of information. Passing an individual value rather than a whole object to a function which only needs one value from the object for example. In the case of an API that does not imply mutability, why should I want to return a type like List that implies the list is mutable?

Why then do so many developers (and API designers) seem to prefer using List<T> in returned data sets? I understand List is implemented on top of arrays and adds some convenience, but the array notation is built-in to the language, and syntactically simpler (and arguably more universally understood with other languages). LINQ adds most of the functionality for projection, searching, sorting, etc that array does not contain by default, and is more or less interoperable with List<T>. Is this a hold over from the .Net Framework 1.1 dark ages of ArrayList, where array has become a bad word? Is List<T> somehow more performant than array[] despite the overhead? Is implying mutability beneficial vs just having the caller do .ToList() if they really need to edit the data?


Update:

I want to thank everyone for taking the time to read this and to all the thoughtful replies, it has struck up quite the conversation, which is awesome. This question was closed. Then reopened. Answered numerous times, with lots of comments and different ideas. I suppose I was hoping for something simple, some benchmark or pointer that one thing is definitively better than another, which doesn't seem to exist. The resounding opinion seems to be, just use the thing that makes the most sense for how the data is to be used. I do feel better that my cause for concern is justified, given the wide breadth of different angles and opinions answering this, so the fact that C# has a bajillion different ways to represent "a related pile of things of the same kind" just happens to be a curse and a blessing for the language and framework that uses it.

K0D4
  • 423
  • 1
  • 6

10 Answers10

28

I can't speak for the industry, but I'd like to point out, that "minimum viable amount of information" is represented by an unmodfiable iterator (in a sense, that it can't modify the underlying data), not an array.

Array can be iterated over multiple times, modified, it's size is known at all times. That's is a very rich contract, that is swimming in extraneous requirements and information.

On the other hand IEnumerable leaks much less information and IEnumerator leaks even less.

An API design choice is obvious - returning an array would leak too much information, return IEnumerator instead. It would be nice to return an interface without a Reset method, but we can't ask for too much. At least it does not have a remove() method like Java's Iterator!

Basilevs
  • 3,896
26

Two reasons.

An array, in C#, is a weird and very special data structure: its length is immutable, but its elements are mutable. Such specificity is rarely needed. Actually, in twenty years of software development, I haven't yet seen a single case in C# where the author of a piece of code was actively seeking this specificity. I've seen a lot of cases where arrays were used wrongly, by someone who didn't know what arrays are, nor what are the other types, such as hash sets, collections, lists, queues, or double linked lists.

Also, an array is a bit low level for general use—it could make sense in C++, but less in C#. A List<T> is based on an array, and provides a good abstraction level over the actual way elements are stored in an array, or what happens when a list grows beyond the initial capacity of the underlying array. You may intentionally seek the low level abstraction provided by an array, but that shouldn't be your default choice by any means. And if you actually want to do low level stuff, Span<T> may possibly be what you are looking for.

For example, I might write a method that does "search the product index and return matching products". Or "get me customer's orders between two dates".

You won't need arrays here, or any type of collection.

These kinds of access patterns do not imply any sort of modification need [...] the size of these arrays never changes.

If nothing changes, an array is a bad data type, as it implies that the elements are mutable.

In the case of an API that does not imply mutability, why should I want to return a type like List that implies the list is mutable?

You shouldn't. But neither should you return an array, for the exact same reason you shouldn't use lists here.

the array notation is built-in to the language

If you want to go the route of notations and built-ins, check collection expression syntax. Hint:

ICollection<int> d = [1, 2, 3];

has absolutely no arrays involved. It does involve lists, spans, and collections, however.

and syntactically simpler

What about the actual usage?

Junior programmer creates an array x.

Junior programmer wants to add an element to x.

He types x.Add(42), and is immediately shouted at by a compiler.

He tries to understand the message, but it's not an easy task. All examples he has seen use Add on collections. So why those examples work, whereas in his code, Add doesn't exist?

LINQ adds most of the functionality for projection, searching, sorting, etc that array does not contain by default

Nor does a list.

13

The array interface has several problems. Firstly, it is not as type safe as other collection types. In C# you expect type errors to be detected at compile time. Type errors at runtime usually only occur when using explicit casting which is well known to be risky. But updating an array can give you runtime type errors even though the code looks perfectly fine and does not use explicit casting.

The issue arise because the way C# handles array subtyping is unsound. If for example Button is a subtype of Control, then C# also considers Button[] a subtype of Control[]. But this is not a sound assumption! A fundamental tenet of subtyping is that any operation you can do on a type, you can do on a subtype. But you can insert any control in Control[] but you can only insert buttons in Button[].

For example:

Button[] widgets = new Button[7];
Frizzle(widgets);

void Frizzle(Control[] widgets) { widgets[1] = new Textbox(); }

class Control {} class Button : Control; class Textbox: Control;

This code will compile fine, but fail with an ArrayTypeMismatchException at runtime even though there is no casting involved.

If you had used some of the other collection types or interfaces, the error would have been detected at compile time. The runtime error is unique to arrays.

Secondly, arrays are a weird combination of mutable and immutable: You can replace elements, but you cant add or remove elements. This is almost never a useful contract. Either you want a collection to be fully immutable or you want it to be fully mutable.

You usually want one of:

  • IEnumerable<T> - immutable, sequential access
  • IReadOnlyList<T> - immutable, random access to elements
  • IList<T> - mutable

Arrays are often fine as the implementation exposed as a IEnumerable<T> or IReadOnlyList<T> since those interfaces protect against the type-safety issue. But there is an additional issue with exposing an array as IList<T>: Arrays actually implement this interface, but will throw a runtime exception if you attempt to add or remove elements. So again it gives you run-time errors instead of compile-time errors.

In short, arrays are not as type safe as other collection types.

JacquesB
  • 61,955
  • 21
  • 135
  • 189
7

I can't really speak to norms or opinions of the C# user community but at a more language agnostic level, I think there's a general trend away from using 'low-level' arrays when there's no specific benefit to using them, at least in what I will call 'high-level' languages.

I'm not completely sure that calling arrays 'low-level' is really even the right way to frame it. It's not necessarily the case that what looks like an array in a language is an array in the C sense. For example, in Python, foo = [] is a way to create a list and the term 'list' and 'array' are used interchangeably.

The real driver to abstract types like lists is that reduce coupling. For example, let's say you have a function/method that takes an array of values and does some sort of aggregate calculation on them. It works fine for your initial needs. Then, a few versions later, you want to use that existing calculation but using the values from a map/dictionary. In order to use the array-based function, you probably need to copy all the values into an array and then pass it to the function. OK. Big deal, right? A little extra memory. RAM is cheap, etc. But what if that input needs to come from a remote source such as a DB. That means allocating local memory, executing the full query and only then passing it to the function. That takes time, assuming you have enough memory available. If you used an abstract type such as a ICollection<T> (or even better an IEnumerable<T>) You can run the calculation concurrent with the query. It's easy to wrap an array with a IEnumerable<T> interface but not easy to make non-arrays into arrays.

That's just one example but using abstractions, especially for common things like lists and dictionaries makes it much easier to mix and match modular components and reduce how much you are converting things from one data structure into another.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
6

There is no aversion against arrays, but arrays have limitations. They have a fixed size, and you need to know the size in advance when using them. But often, e.g., when reading records from a database, you do not know the number of items to be stored. Therefore, a list is very convenient as it grows automatically.

Of course, you could first read the data into a List<T> and then call .ToArray() to convert them into an array, but this is not very efficient. Lists also implement IReadOnlyList<T>. So, you could decide to return them as read-only through this interface.

You can create arrays of more than one dimension. Therefore, arrays are often used when a matrix is required, or if you want to represent a checkerboard in a game, for instance. You can find matrix types in the System.Numerics Namespace, but normal collections have no built-in multi-dimensionality. Well, you can create a collection of collections, but you are responsible for their management.

IEnumerable<T> is not a collection and it cannot store anything. It is a mechanism to enumerate items. So, it cannot directly be compared to arrays or lists (while they and most other collections implement it). But it offers an efficient way of returning data. Often you want to process this data immediately and do not need it to be stored anyway. Example: If you have a method querying and returning records from a database, you can use an iterator method which is directly streaming the data to the consumer.

public IEnumerable<MyData> GetMyData(...)
{
    using var connection = new SqlServerConnection("...");
    using var command = new SqlCommand("SELECT [name], [value] FROM mydata WHERE ...", connection);
    connection.Open();
using SqlDataReader reader = command.ExecuteReader();
while (reader.Read())
{
    var data = new MyData {
        Name = reader.GetString(0),
        Value = reader.GetInt32(1)
    };
    yield return data;
}

}

5

I think at least some part of what you're seeing can be ascribed to the (now venerable) .NET Framework Design Guidelines, which were first published in 2005 very near the start of .NET's life. Microsoft's own libraries tend to stick pretty closely to these guidelines, which will affect a lot of what developers see and therefore have an effect on what they produce. The 2nd edition is online at https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/ ; relevant parts include:

https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/arrays

✔️ DO prefer using collections over arrays in public APIs. The Collections section provides details about how to choose between collections and arrays.

https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/guidelines-for-collections

❌ DO NOT use ArrayList or List in public APIs.

These types are data structures designed to be used in internal implementation, not in public APIs.

And most pertinently:

Choosing Between Arrays and Collections

✔️ DO prefer collections over arrays.

Collections provide more control over contents, can evolve over time, and are more usable. In addition, using arrays for read-only scenarios is discouraged because the cost of cloning the array is prohibitive. Usability studies have shown that some developers feel more comfortable using collection-based APIs.

However, if you are developing low-level APIs, it might be better to use arrays for read-write scenarios. Arrays have a smaller memory footprint, which helps reduce the working set, and access to elements in an array is faster because it is optimized by the runtime.

✔️ CONSIDER using arrays in low-level APIs to minimize memory consumption and maximize performance.

✔️ DO use byte arrays instead of collections of bytes.

❌ DO NOT use arrays for properties if the property would have to return a new array (e.g., a copy of an internal array) every time the property getter is called.

AakashM
  • 2,162
5

The thing to understand is C# has real arrays, in the formal computer science sense. This means several things, among them:

  1. Fixed size
  2. Contiguous memory

Many other platforms have array constructs of some kind which are in fact NOT arrays, at least in the formal computer science sense, because it violates one (and usually both) of those features (though they will often paper over the latter). Rather, the platforms offer a collection with the "array" name merely attached to it.

These other platforms are correct to do this, in a sense, because it turns out real computer science -style arrays are not what we need most of the time; a collection type is almost always far more appropriate. That is, it's extremely common to either want to be able to do things like append or remove items, look up elements by key rather than index, or for the contents to be immutable. None of these things are guaranteed by simple arrays.

Thankfully, C# includes a number of collection types as well to fill in these gaps and more, and we can use types such as List<T>, Dictionary<T>, and many others. It's worth noting these collection types tend to also closely relate to computer science concepts of their own, though the naming is not always perfect (it's possible List<T> should have been named Vector<T>, for example).

This should be enough now to understand why the practice you observed for C# is the way it is, or at least why array is not commonly chosen.


This collection vs array design should be considered a strength for C#, rather than a weakness.

Formally specifying the various collections gives the programmer the power and guidance to use the collection type that is actually appropriate to the situation, with less encouragement to fall back to a baseline catch-all array type. Additionally, it gives the programmer power to use a real array when appropriate, which after all exists for a reason and can have certain nice performance wins when working with truly low-level code, such as interoperating with low-level operating system or network constructs.


Why then do so many developers (and API designers) seem to prefer using List in returned data sets?

I think I have explained why we don't use array, but in my opinion you are right to question this: the choice of List<T> in these specific scenarios is also incorrect and has led to a lot of inefficient C# data access code over the years.

Specifically, this choice creates a tendency for code to call .ToList() when it would not otherwise be necessary, or to manually create a list instance to return and add each record. It forces us to fully materialize data result sets when we might otherwise be able to limit memory use to one record at a time.

In most cases, we should be using IEnumerable<T> instead of List<T>. IQueryable<T> and (more recently) IAsyncEnumerable<T> are also good options. In any case, from here we either skip the .ToList() call or use a yield iterator instead of instantiating a list and adding records. Again, this allows us to set up data processing systems that stream data one record at a time, rather than materializing entire result sets as usually happens today. This could dramatically improve memory use (for I hope obvious reasons) and initial response times (because we can start streaming data as we receive the first record, instead of waiting for the last record to be added to a list).


This also displays the weakness of my argument that individual collection types are a strength: the various types are only a strength to the degree programmers understand them and make good choices. In practice, we often still fall back to a baseline List<T> collection, whether or not it's really the right option.

But as one final counter-point, this is still no worse than what happens on other platforms, and at least makes it a little easier to do the right thing now and then.

2

If an array that doesn't change size is all you need, there is absolutely nothing wrong with using a plain array.

If you need to grow and shrink your data dynamically though, it certainly makes sense to use List<> over rolling your own abstraction layer over arrays.

If you are writing a library for other people to use, you don't know what they are doing with the results. Whether they need that abstraction. It might be good to use such a layer for convenience and general acceptance of your library.

But if you are your own end user and you know you don't need it... just don't use it. Nothing wrong with knowing what you need and using exactly that and nothing more.

nvoigt
  • 9,230
  • 3
  • 30
  • 31
0

Efficiency

The thing with serialization formats that transfer data around a serialization point is that it's common they do not indicate how many items will be serialized. For example, the JSON format: there is no indication of how many items will exist after the first [. But if the data is returned as an array, then you really expect the length property is correct. No missing data, no extraneous null or empty slots.

The only ways to create an "exact" array in this case is start with an empty array and resize it on every new element retrieved (something that is enormously inefficient), or to use bigger arrays and resize it at the end to trim any unused slot (something that is enormously inefficient on the final, big array).

Using List<> avoids recreating all the functionality of resizing, and also avoids the last step of duplicating the array and copying the data into an array with the correct size.

Type errors

Another thing with C# arrays is that you cannot expect it to always work with the type system. This comes from an very unfortunate envy of Java, that also contains a very unfortunate design. You can always count that an List<object> will add any object, but this is not true for arrays:

List<object> lst = DecodeList();
lst.Add( "" );  // Ok
lst.Add( 42 );  // Ok

object[] arr;

arr = DecodeArrayInt(); arr[0] = 42;  // Ok arr[0] = "";  // Will throw ArrayTypeMismatchException

arr = DecodeArrayStr(); arr[0] = 42;  // Will throw ArrayTypeMismatchException arr[0] = "";  // Ok

// Because

private object[] DecodeArrayInt() {     return new int[1]; }

private object[] DecodeArrayStr() {     return new string[1]; }

The error above also occurs with derived classes. Think of your surprise when you receive an runtime error of an Base[] array failing to assign an Derived instance. Assigning an Giraffe and a Turtle on an Animal[] may work for both, for one but not another, or none. You cannot possibly know.

From this, I would argue that arrays are not simple at all. On the contrary. Also, because this is a runtime error, this is also a runtime cost, that makes arrays not only not simple, but also slow.

-1

I think you are mistaking something else for an aversion to array’s. When returning a value from an API, you have to choose how expressive that result will be. Even with something as simple as true/false you may choose to use an enum which is more expressive or restrict it to just the true false and require the consumer to know and keep track of what that value means as it is passed around.

Returning a List is the more expressive approach, IEnumerable is the more restrictive approach.

What your question points out is that the main “advantage” of an array is in communicating that it’s complete. And (a) that is better conveyed via an IEnumerable and (b) it is an assumption on the part of the API. Adding additional items to the result of an API call happens all the time.

The real advantage of an array is not in what it does, but an implementation detail and what it allows — arrays are fixed size and the memory structure is continuous allowing the contents to be copied faster than any other structure. But this isn’t something that is guaranteed by the definition of an array in .Net. Microsoft (or anyone else implementing the runtime) could create a better or worse implementation and the only thing that would break would be things that used Unsafe.

There isn’t an aversion to arrays, they just don’t offer any real benefits.

jmoreno
  • 11,238