Skip to content

Commit 32f1d87

Browse files
committed
2 parents aac35d6 + 142305c commit 32f1d87

5 files changed

Lines changed: 109 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
# Dev
2+
3+
## New features
4+
5+
* `byrow(ds, t::DataType, col)` convert values of `col` to `t`.
6+
7+
## Fixes
8+
9+
* Fix an issue in `flatten/!` - columns with type `Any`.
10+
111
# Version 0.7.3
212

313
## New features

docs/src/man/basics.md

Lines changed: 87 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ The first line of the output provides the general information about the data set
4444
The following example shows how to create a data set by providing a range of values.
4545

4646
```jldoctest
47-
julia> Dataset(A=1:3, B=5:7, fixed=1)
47+
julia> Dataset(A = 1:3, B = 5:7, fixed = 1)
4848
3×3 Dataset
4949
Row │ A B fixed
5050
│ identity identity identity
@@ -99,7 +99,7 @@ julia> Dataset([1 0; 2 0], :auto)
9999
1 │ 1 0
100100
2 │ 2 0
101101
102-
julia> Dataset([[1 ,2], [0, 0]], :auto)
102+
julia> Dataset([[1, 2], [0, 0]], :auto)
103103
2×2 Dataset
104104
Row │ x1 x2
105105
│ identity identity
@@ -369,6 +369,91 @@ julia> insertcols!(ds, :var3 => [3.5, 4.6, 32.0])
369369
3 │ 3 val3 32.0
370370
```
371371

372+
### Converting the columns' type
373+
374+
To convert the values of a column to another type, user can use the following syntax:
375+
376+
`modify!(ds, col => byrow(T))`
377+
378+
where `ds` is the input data set, `col` is the column which its values' type is going to be converted and `T` is the new type (the `byrow` function is discussed in [Row-wise operations](https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/byrow/), and the `modify!` function is discussed in [Transforming datasets](https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/modify/#Transforming-data-sets)). This functionality must be used in cases where each individual value needed to be converted. For scenarios that the convertion process needs the information of all values in a column, the `byrow` function must be dropped, e.g. `modify!(ds, col => PooledArray)`. Additionally, user may allow `Julia` to find the most suitable type of a column by calling `modify!(ds, col => byrow(identity))`. In the following example we are using `modify!` to correct the type of columns in `ds`.
379+
380+
> Note that in the following example calling `byrow(identity)` on `:y` convert type `Any` to `Integer`. However, note that `Integer` is an abstract type and it will slow down the performance of operations on `ds`. To improve the performance of calculations, user may use `modify!(ds, :y => byrow(Int))` instead.
381+
382+
```jldoctest
383+
julia> using PooledArrays
384+
385+
julia> ds = Dataset(x = [missing,2,3,4], y = Any[1,missing,-1,true], z = ["a", "bc", "a", missing])
386+
4×3 Dataset
387+
Row │ x y z
388+
│ identity identity identity
389+
│ Int64? Any String?
390+
─────┼──────────────────────────────
391+
1 │ missing 1 a
392+
2 │ 2 missing bc
393+
3 │ 3 -1 a
394+
4 │ 4 true missing
395+
396+
julia> modify!(ds, :x => byrow(Float64), :y => byrow(identity), :z => PooledArray)
397+
4×3 Dataset
398+
Row │ x y z
399+
│ identity identity identity
400+
│ Float64? Integer? String?
401+
─────┼───────────────────────────────
402+
1 │ missing 1 a
403+
2 │ 2.0 missing bc
404+
3 │ 3.0 -1 a
405+
4 │ 4.0 true missing
406+
407+
julia> ds[:, :x]
408+
4-element Vector{Union{Missing, Float64}}:
409+
missing
410+
2.0
411+
3.0
412+
4.0
413+
414+
julia> ds[:, :y]
415+
4-element Vector{Union{Missing, Integer}}:
416+
1
417+
missing
418+
-1
419+
true
420+
421+
julia> ds[:, :z]
422+
4-element PooledVector{Union{Missing, String}, UInt32, Vector{UInt32}}:
423+
"a"
424+
"bc"
425+
"a"
426+
missing
427+
```
428+
429+
To convert the type of multiple columns at once, user may use the boradcasting technique:
430+
431+
```jldoctest
432+
julia> using PooledArrays
433+
434+
julia> ds = Dataset(x = [missing,2,3,4], y = Any[1,missing,-1,true], z = ["a", "bc", "a", missing])
435+
4×3 Dataset
436+
Row │ x y z
437+
│ identity identity identity
438+
│ Int64? Any String?
439+
─────┼──────────────────────────────
440+
1 │ missing 1 a
441+
2 │ 2 missing bc
442+
3 │ 3 -1 a
443+
4 │ 4 true missing
444+
445+
julia> modify!(ds, [:x, :y] .=> byrow(Float64)) # note "." in ".=>"
446+
4×3 Dataset
447+
Row │ x y z
448+
│ identity identity identity
449+
│ Float64? Float64? String?
450+
─────┼────────────────────────────────
451+
1 │ missing 1.0 a
452+
2 │ 2.0 missing bc
453+
3 │ 3.0 -1.0 a
454+
4 │ 4.0 1.0 missing
455+
```
456+
372457
### Some useful functions
373458

374459
The following functions are very handy when working with a data set, for more information look at the package documentation. Note that functions which end with `!` modify the original data set.

src/byrow/byrow.jl

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -194,9 +194,6 @@ byrow(ds::AbstractDataset, ::typeof(join), col::MultiColumnIndex; threads = nrow
194194

195195
byrow(ds::AbstractDataset, ::typeof(mapreduce), cols::MultiColumnIndex = names(ds, Union{Missing, Number}); op = .+, f = identity, init = _missings(mapreduce(eltype, promote_type, view(_columns(ds),index(ds)[cols])), nrow(ds)), kwargs...) = mapreduce(f, op, eachcol(ds[!, cols]), init = init; kwargs...)
196196

197-
# specific path for converting Any to suitable type
198-
byrow(ds::AbstractDataset, ::typeof(identity), col::ColumnIndex) = identity.(_columns(ds)[index(ds)[col]])
199-
200197

201198
function byrow(ds::AbstractDataset, f::Function, cols::MultiColumnIndex; threads = nrow(ds)>1000)
202199
colsidx = multiple_getindex(index(ds), cols)
@@ -223,3 +220,10 @@ function byrow(ds::AbstractDataset, f::Function, col::ColumnIndex; threads = nro
223220
end
224221
res
225222
end
223+
224+
225+
# specific path for converting Any to suitable type
226+
byrow(ds::AbstractDataset, ::typeof(identity), col::ColumnIndex) = identity.(_columns(ds)[index(ds)[col]])
227+
228+
# special case for converting the type of a column conveniently
229+
byrow(ds::AbstractDataset, f::Type, col::ColumnIndex) = convert(Vector{Union{Missing, f}}, _columns(ds)[index(ds)[col]])

src/byrow/doc.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ byrow_docs_text = """
9898
Perform a row-wise operation specified by `fun` on selected columns `cols`. Generally,
9999
`fun` can be any function that returns a scalar value for each row.
100100
101+
> User can pass a type as `fun` when `cols` is referring to a single column. In this case, `byrow` simply converts the selected column to vector of type `fun`.
102+
101103
`byrow` is fine tuned for the following operations. To get extra help for each of them search help for `byrow(fun)`, e.g. `?byrow(sum)`;
102104
103105
# Reduction operations
@@ -1234,6 +1236,8 @@ Variant of `byrow(stdze!)` which pass a copy of `ds` and leave `ds` untouched.
12341236
12351237
Return the result of calling `fun` on each row of `ds` selected by `cols`. The `fun` function must accept one argument which contains the values of each row as a vector of values and return a scalar.
12361238
1239+
When user passes a type as `fun` and a single column as `cols`, `byrow` convert the corresponding column to the type specified by `fun`.
1240+
12371241
For generic functions there are two special cases:
12381242
12391243
* When `cols` is a single column, `byrow(ds, fun, cols)` acts like `fun.(ds[:, cols])`

src/dataset/modify.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -528,7 +528,7 @@ function _modify_f_barrier(ds, msfirst, mssecond, mslast)
528528
end
529529
catch e
530530
if e isa MethodError
531-
throw(ArgumentError("There is problem in your `byrow`, make sure that the output of `byrow` is a vector"))
531+
throw(ArgumentError("There might be a problem in the `byrow` usage, make sure that the output of `byrow` is a vector"))
532532
end
533533
rethrow(e)
534534
end

0 commit comments

Comments
 (0)