L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

NVIDIA introduced its PyTorch-based framework, which is designed to help integrate deep learning solutions into healthcare imaging projects.

https://github.com/Project-MONAI/MONAI

#ml

GitHub

GitHub - Project-MONAI/MONAI: AI Toolkit for Healthcare Imaging

AI Toolkit for Healthcare Imaging. Contribute to Project-MONAI/MONAI development by creating an account on GitHub.

160 viewsedited 13:09

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

A large review on how TickTock uses machine learning to increase user engagement and pierce the "filter bubble". Nothing really fancy just interesting to read the problems they trying to solve.

https://www.axios.com/inside-tiktoks-killer-algorithm-52454fb2-6bab-405d-a407-31954ac1cf16.html

Axios

TikTok reveals details of how its algorithm works

The beleaguered app describes the inner workings of its video-selection code.

244 views13:09

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Announcing GPU accelerated Spark 3.0

https://youtu.be/tpeGZ7nm0J0

#spark #big_data #ml

YouTube

NVIDIA GTC May 2020 Keynote Pt3: GPU Accelerating HPC and Scientific Computing

NVIDIA CEO Jensen Huang describes how NVIDIA GPU acceleration is the path forward for #HPC and scientific computing, which now boasts 700+ CUDA-accelerated applications. Learn about the importance of recommendation systems and how NVIDIA and Mellanox technologies…

174 viewsedited 13:09

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

AWS has implemented the linux-based operating system Bottlerocket. It is an open-source project, developed by AWS as a basic host to run containers. The general idea is that nowadays, in most cases, general-purpose operating systems are used to start containers, which does not contribute to the security and capability of an atomic update.

https://aws.amazon.com/blogs/opensource/announcing-the-general-availability-of-bottlerocket-an-open-source-linux-distribution-purpose-built-to-run-containers/

GitHub

GitHub - bottlerocket-os/bottlerocket: An operating system designed for hosting containers

An operating system designed for hosting containers - bottlerocket-os/bottlerocket

169 views13:09

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I've been recommending Pi-hole not long ago but RCE exploit was discovered in the Pi-hole software. This particular problem requires authenticated access to the Pi-hole administrative web interface, so it’s not likely to cause too many problems on its own but anyway.

https://frichetten.com/blog/cve-2020-11108-pihole-rce/

#privacy

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Who wants to remove ads and tracking without any extensions, check out Pi-Hole on Raspberry Pi Zero.

https://youtu.be/KBXTnrD_Zs4

#privacy

179 viewsedited 13:09

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Famous in-memory data format

Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly - from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.

I actually don't know any other in-memory format with complex data, dynamic schemas, performance, and platform support.

Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:

- SQL execution engines (Drill, Impala etc)
- Data analysis systems (Pandas, Spark etc)
- Streaming and queueing systems (Kafka, Storm etc)
- Storage systems (Parquet, Kudu, Cassandra etc)
- Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)

Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.

https://arrow.apache.org/overview/

#big_data

Apache Arrow

Format

Arrow Format

184 viewsedited 13:09

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Where do I start to learn AWS?

So, if you go to the AWS Documentation you will see an endless list of services, but it's just the global table of contents of global tables of contents! That's right — Amazon is huge right now. At the time of writing these lines are two and a half hundred services under the hood. It is not realistic to learn them all, and there is no reason to do it at all.

John Markoff says “The Internet is entering its Lego era.” AWS services is similar to Lego — you finding the right pieces and combine them together. In order to highlight the most essential pieces it is reasonable to say that they were historically the first. They are:

- S3 — storage
- EC2 — virtual machines + EBS drives
- RDS — databases
- Route53 — DNS
- VPC — network
- ELB — load balancers
- CloudFront — CDN
- SQS/SNS — messages
- IAM — main access rights to everything
- CloudWatch — logs/metrics

Then there are modern serverless pieces (Lambda, DynamoDB, API Gateway, CloudFront, IAM, SNS, SQS, Step Functions, EventBridge).

#aws

178 viewsedited 13:09

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Rapids

Nvidia has been developing an open source platform Rapids, whose task is to accelerate the work of data processing and machine learning algorithms on the GPU. Developers on Rapids don't have to use different libraries: they just write code in Python, and Rapids automatically optimizes it to run on the GPU. All data is stored in the Apache Arrow format in-memory.

I already wrote about GPU vs CPU. But the problem is that the amount of memory using the CPU we have now is limited to terabytes, and the GPU has a maximum of 50 GB of memory. Here Dask comes to the rescue - integration with Dask gives Rapids GPU clusters with multi GPU support.

The Rapids repository has the cuDF library for data preparation and neural network training, and the cuML library allows to develop machine learning algorithms without going into the details of CUDA programming.

Sounds cool, doesn't it? But, there is always but:
- it's still not production ready
- porting any complex udf is very hard (at least you should know cuda, which I don't)
- no cpu libraries version for inference
- no automatic memory management
- it's nvidia only

https://github.com/rapidsai

#ml

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

CPU vs GPU

https://youtu.be/-P28LKWTzrI

168 viewsedited 13:09

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

MLOps

Our ML algorithms are fine, but good results do require a significant team of data specialists, data engineers, field experts, and more support staff. And while the number and cost of expert staff is not constraining enough, our understanding of how to optimize for nodes, layers, and hyperparameters is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the estimation created by a model can often only be achieved by continuing to use the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.

MLOps — is a combination of technologies and processes of machine learning and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. DevOps is an approach to software development that allows increasing the speed of implementation of individual changes while maintaining flexibility and reliability through a number of approaches, including continuous development, division of functions into a number of independent microservices, automated testing and deploying of individual changes, global performance monitoring, a system of prompt response to detected failures, etc.

MLOps, or DevOps for machine learning, allows data science and IT teams to collaborate and accelerate model development and implementation by monitoring, validating, and managing machine learning models.

Of course, there is nothing new here — everyone has been doing it one way or another for a while. Now just a hype word appears behind which there are usually ready-made solutions like Seldon, Kubeflow, or MLflow.

#ml

185 viewsedited 13:09

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. The framework's creators are active in promoting it - they say it's kind of cool and they also promise SOTA in NLP. I haven't seen it, but it would be interesting to compare it with real capabilities - so far it looks promising.

https://github.com/JohnSnowLabs/spark-nlp

#spark #ml

GitHub

GitHub - JohnSnowLabs/spark-nlp: State of the Art Natural Language Processing

State of the Art Natural Language Processing. Contribute to JohnSnowLabs/spark-nlp development by creating an account on GitHub.

193 viewsedited 13:09

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

In short, I wrote a book to put an additional title to my name - author. In short, I got a lot of experience and little sense. I will be glad if you recommend/read/write a review on it.

https://www.amazon.com/dp/B08KG1DNRD/ref=cm_sw_r_cp_awdb_t1_K-rDFbH218AY4

157 views09:12

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Hit like button if you want to know more about the tech writing topic - despite the fact that I did everything myself I dive deeper into this a little.

159 views13:09

👍 6

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap. In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.

https://luminousmen.com/post/data-management-skills

Blog | iamluminousmen

Data Engineering skills

In data management, we are still in the Wild West - new trends emerge every day. How to stay relevant to the industry?

181 views09:12

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Jet suit is a reality

https://youtu.be/U1PKyyabipc

YouTube

Disaster Support!

ServeOn is a humanitarian response charity that saves lives when natural disasters strike. Volunteers from all walks of life, helping those in need.

Gravity was delighted to go see a wonderful uk project to rebuild an aged weir that has also served as a…

438 views13:09

👍 2